# Chapter 7: Evaluating mapping accuracy

## Preparations

Load necessary packages.

In [None]:
using JudiLing, DataFrames, Plots

Load the Dutch dataset we will be working with.

In [None]:
# Adjust the filepath to the location of your dutch.csv file.
dutch = JudiLing.load_dataset("../dat/dutch.csv");
dutch = dutch[:,[:Ortho, :Word, :Number, :WordCat, :Lexeme, :Syllables, :Frequency]];

Generate `cue_obj` and `S` matrix:

In [None]:
cue_obj = JudiLing.make_cue_matrix(dutch,
                                   grams=3,
                                   target_col="Ortho");
S, words = JudiLing.load_S_matrix("../dat/dutch_w2v.csv"; header = false, sep = ",");

## Evaluating mapping accuracy

Compute mapping matrix F using the endstate of learning.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S)
Shat = cue_obj.C * F

`eval_SC` calculates for each predicted vector the similarity with each vector in $\mathbf{S}$. If the most similar vector is the target vector, the mapping is counted as correct.

In [None]:
JudiLing.eval_SC(Shat, S)

However, this strict evaluation might be unfair. From the form of homographs, which have identical forms, it is impossible to know which semantics are correct (this is why there is a warning message). Consider the following example:

In [None]:
dutch[dutch.Ortho .== "missen",:]

From the form "missen" alone it is impossible which of the two meanings is referred to. We therefore might want to count a mapping as correct, if it is closest to any of the two meanings of "missen".

By additionally specifying the dataset and the column with wordforms, homographs are taken into account during evaluation:

In [None]:
JudiLing.eval_SC(Shat, S, dutch, :Ortho)

In this case, accuracy hasn't improved, presumably because in the previous evaluation the correct form was selected by chance. In general, our Dutch dataset only includes very few homographs. However, in languages such as German which has many homographs across paradigm cells, substantial improvements can be expected when using lenient evaluation.

We might also be interested in whether the target vector is among the top k most correlated. For this, we can use `eval_SC_loose`, specifying k to be 5 and again adding the dataset and the column with wordforms in order to take homographs into account.

In [None]:
JudiLing.eval_SC_loose(Shat, S, 5, dutch, :Ortho)

Finally, we can also compute token-based acccuracy by supplying frequency counts for the word types:

In [None]:
JudiLing.eval_SC(Shat, S, dutch, :Ortho, freq=dutch.Frequency)

Token-based accuracy using FIL:

In [None]:
F_fil = JudiLing.make_transform_matrix(cue_obj.C, S, dutch.Frequency)
Shat_fil = cue_obj.C * F_fil

@show JudiLing.eval_SC(Shat_fil, S, dutch, :Ortho)
@show JudiLing.eval_SC(Shat_fil, S, dutch, :Ortho, freq=dutch.Frequency)

We can follow the same procedure for production:

In [None]:
G = JudiLing.make_transform_matrix(S, cue_obj.C)
Chat = S * G

In [None]:
JudiLing.eval_SC(Chat, cue_obj.C)

However, this type of evaluation is not ideal for production, as no actual word forms have been produced. Details on how to to this can be found in the next notebook.

Finally, we might be interested in what kind of errors the model makes during comprehension. `JudiLing.accuracy_comprehension` is designed to give information about this.

In addition to specifying the semantic matrix, the predicted matrix, the dataset and the column with the target wordforms, we also specify the columns with the base and grammatical features:

In [None]:
acc = JudiLing.accuracy_comprehension(S, Shat, dutch,
                                        target_col=:Ortho,
                                        base=["Lexeme"],
                                        inflections=["Number", "WordCat"]);

To see what the generated `acc` object contains, we can consult the help pages:

In [None]:
?acc

`acc.acc` contains the mapping accuracy:

In [None]:
acc.acc

`acc.dfr` is a dataframe with each target word form, its predicted form, correlation with the predicted semantics and the target semantics, a column indicating whether the wordform was comprehended correctly and for each of the semantic components (here: lexeme, number and word category) information about whether they were correctly recognised:

In [None]:
first(acc.dfr, 5)

Finally, `acc.err` contains a list of indices of all wordforms which were comprehended incorrectly:

In [None]:
first(acc.err, 5)

## Exercises

Preparation:

In [None]:
latin = JudiLing.load_dataset("../dat/latin.csv")

### Exercise 1

Comparing performance of various n-gram sizes

In [None]:
S = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person", "Number", "Tense", "Voice", "Mood"],
    ncol=300)
JudiLing.display_matrix(latin, :Word, S, S, :S)

In [None]:
cue_obj2 = JudiLing.make_cue_matrix(latin, grams=2, target_col=:Word);
cue_obj3 = JudiLing.make_cue_matrix(latin, grams=3, target_col=:Word);

In [None]:
F2 = JudiLing.make_transform_matrix(cue_obj2.C, S)
F3 = JudiLing.make_transform_matrix(cue_obj3.C, S);

In [None]:
Shat2 = cue_obj2.C * F2
JudiLing.eval_SC(Shat2, S)

In [None]:
Shat3 = cue_obj3.C * F3
JudiLing.eval_SC(Shat3, S)

In [None]:
G2 = JudiLing.make_transform_matrix(S, cue_obj2.C)
G3 = JudiLing.make_transform_matrix(S, cue_obj3.C);

In [None]:
Chat2 = S * G2
JudiLing.eval_SC(Chat2, cue_obj2.C)

In [None]:
Chat3 = S * G3
JudiLing.eval_SC(Chat3, cue_obj3.C)

Accuracy is higher with trigrams than with bigrams

### Exercise 2

Comparing different S matrix dimensionalities:

In [None]:
S50 = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person", "Number", "Tense", "Voice", "Mood"],
    ncol=200)

S300 = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person", "Number", "Tense", "Voice", "Mood"],
    ncol=300)

S1000 = JudiLing.make_S_matrix(
    latin,
    ["Lexeme"],
    ["Person", "Number", "Tense", "Voice", "Mood"],
    ncol=1000)

In [None]:
F50 = JudiLing.make_transform_matrix(cue_obj2.C, S50)
F300 = JudiLing.make_transform_matrix(cue_obj2.C, S300);
F1000 = JudiLing.make_transform_matrix(cue_obj2.C, S1000);

In [None]:
Shat50 = cue_obj2.C * F50
JudiLing.eval_SC(Shat50, S50)

In [None]:
Shat300 = cue_obj2.C * F300
JudiLing.eval_SC(Shat300, S300)

In [None]:
Shat1000 = cue_obj2.C * F1000
JudiLing.eval_SC(Shat1000, S1000)

For comprehension, no clear influence of semantic dimensionality on accuracy.

In [None]:
G50 = JudiLing.make_transform_matrix(S50, cue_obj2.C)
G300 = JudiLing.make_transform_matrix(S300, cue_obj2.C);
G1000 = JudiLing.make_transform_matrix(S1000, cue_obj2.C);

In [None]:
Chat50 = S50 * G50
JudiLing.eval_SC(Chat50, cue_obj2.C)

In [None]:
Chat300 = S300 * G300
JudiLing.eval_SC(Chat300, cue_obj2.C)

In [None]:
Chat1000 = S1000 * G1000
JudiLing.eval_SC(Chat1000, cue_obj2.C)

For production, a larger semantic dimensionality improves performance.

### Exercise 3

Comparing the accuracy with strict evaluation taking into account homographs for mappings between a bigram cue matrix and a 300-dimensional semantic matrix.

In [None]:
Shat300 = cue_obj2.C * F300
# strict
JudiLing.eval_SC(Shat300, S300)

In [None]:
# taking into account homographs
JudiLing.eval_SC(Shat300, S300, latin, :Word)

In [None]:
Chat300 = S300 * G300
# strict
JudiLing.eval_SC(Chat300, cue_obj2.C)

In [None]:
# taking into account homographs
JudiLing.eval_SC(Chat300, cue_obj2.C, latin, :Word)

Contrary to intuition, the accuracy is higher for the strict evaluation than for the one taking into account homographs. To illustrate why this happens, let's inspect "terreereemus", which is counted as correct under "strict" evaluation and incorrect under the one taking into account homographs:

Let's first extract the correlation matrix. This is the same in both evaluation modes, so we can just use one of them to do so:

In [None]:
acc, R = JudiLing.eval_SC(Chat300, cue_obj2.C, R= true)

"terreereemus" is in the following line

In [None]:
latin[latin.Word .== "terreereemus",:]

So let's see what the correlations in line 448 are:

In [None]:
R[448,:]

With which word is the predicted form vector of "terreereemus" correlated the most?

In [None]:
argmax(R[448,:])

With the word in row 28. Let's see which word that is:

In [None]:
latin[28,:]

"terreemus" != "terreereemus", and therefore, when we take into account the wordforms in the dataframe, this is counted as incorrect. Why is it counted as correct in the strict evaluation then?

Let's take a look at the actual correlation value of "teereereemus" with row 28:

In [None]:
R[448,28]

"strict" evaluation counts a mapping as correct if the highest correlation is the same as the correlation on the diagonal of the correlation matrix, assuming that prediction and target are in the corresponding rows in the predicted and the target matrices. Let's see what the correlation is of row 448 in the predicted matrix with row 448 in the target matrix:

In [None]:
R[448,448]

It's the same value as with line 28! Therefore, it is true that the maximum correlation for this row is the same as the one on the diagonal, and the mapping is counted as correct. In the evaluation taking into account homographs on the other hand, the row with the maximum correlation which is picked is not necessarily on the diagonal and may therefore be counted as incorrect, which is what happens here.

### Exercise 4

Compute loose evaluation with various values of k:

In [None]:
Shat300 = cue_obj2.C * F300
JudiLing.eval_SC(Shat300, S300)

In [None]:
JudiLing.eval_SC_loose(Shat300, S300, 5)

In [None]:
JudiLing.eval_SC_loose(Shat300, S300, 10)

In [None]:
Chat300 = S300 * G300
JudiLing.eval_SC(Chat300, cue_obj2.C)

In [None]:
JudiLing.eval_SC_loose(Chat300, cue_obj2.C, 5)

In [None]:
JudiLing.eval_SC_loose(Chat300, cue_obj2.C, 10)

With higher k accuracy increases.

### Exercise 5

Compute token-based accuracy:

In [None]:
JudiLing.eval_SC(Shat300, S300, latin, :Word, freq=latin.sim_freq)

In [None]:
JudiLing.eval_SC(Chat300, cue_obj2.C, latin, :Word, freq=latin.sim_freq)

Accuracy stays very similar under token-based evaluation compared to type-based evaluation.

### Exercise 6

Using `accuracy_comprehension` to inspect which words are not understood correctly in a mapping from a bigram cue matrix to a 300-dimensional semantic matrix. 

In [None]:
acc = JudiLing.accuracy_comprehension(S300, Shat300, latin,
                                        target_col=:Word,
                                        base=["Lexeme"],
                                        inflections=["Person", "Number", "Tense", "Voice", "Mood"]);

In [None]:
dfr = acc.dfr

With which inflectional feature does the mapping struggle with the most?

In [None]:
sum.(eachcol(dfr[:,["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"]]))

It makes the most mistakes for "Mood" (note that this may look different for you, since the simulated semantic matrix introduces randomness into the process).