# Chapter 12.10: Dutch devoicing (Solutions for Exercise 2)

## Preparations

Load the necessary packages. In addition to the usual packages, for this part we also require two additional packages: `StatsBase` which includes basic functions for use in statistics. You can find more information about this package [here](https://juliastats.org/StatsBase.jl/stable/). Secondly, we require `Random`, a julia base package which includes functionality for random number generation, more information [here](https://docs.julialang.org/en/v1/stdlib/Random/).

In [None]:
using JudiLing, DataFrames
using StatsBase, Random, LinearAlgebra

Load the usual dutch dataset.

In [None]:
# Adjust the filepath to the location of your dutch.csv file.
dutch = JudiLing.load_dataset("../dat/dutch.csv");
first(dutch, 5)

## Performance on all data

<font color='red'>Here we load fasttext vectors</font>

In [None]:
size(dutch)

In [None]:
dutch, S = JudiLing.load_S_matrix_from_fasttext(dutch, :nl, target_col = "Ortho");

In [None]:
size(dutch)

<font color='red'>NOTE: this dataset contains 15 fewer words than with word2vec</font>

Create a cue object. We are interested in the pronunciation of the wordforms and therefore use the `"Word"` column.

In [None]:
cue_obj = JudiLing.make_cue_matrix(dutch, grams=3, target_col="Word");

Train comprehension matrix and predict semantic matrix.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S);
Shat = cue_obj.C * F;

Compute comprehension accuracy:

In [None]:
c_acc = JudiLing.eval_SC(Shat, S)

Train production matrix and predict cue matrix:

In [None]:
G = JudiLing.make_transform_matrix(S, cue_obj.C);
Chat = S * G;

Compute produced forms:

In [None]:
prod_res = JudiLing.learn_paths(dutch, cue_obj, S, F, Chat,threshold=0.01);

Get production accuracy:

In [None]:
p_acc = JudiLing.eval_acc(prod_res, cue_obj)

In [None]:
JudiLing.write2csv(prod_res, dutch, cue_obj, cue_obj, "../res/prod_ft.csv",
target_col=:Word);

Using frequency-informed learning

In [None]:
F_fil = JudiLing.make_transform_matrix(cue_obj.C, S, dutch.Frequency);
Shat_fil = cue_obj.C * F_fil;

c_acc_fil = JudiLing.eval_SC(Shat_fil, S)

In [None]:
G_fil = JudiLing.make_transform_matrix(S, cue_obj.C, dutch.Frequency);
Chat_fil = S * G_fil;

prod_res_fil = JudiLing.learn_paths(dutch, cue_obj, S, F_fil, Chat_fil,threshold=0.01);

p_acc_fil = JudiLing.eval_acc(prod_res_fil, cue_obj)

## Performance on held-out plural forms

First, we need to create a split where the test data only contains plural forms. Unfortunately, this is a task beyond the capabilities of the `loading_data_careful_split` function in JudiLing, so we need to make the split manually:

In [None]:
describe(dutch)

In [None]:
# create a list of the rownumbers in the full dutch dataset
rownumbers = 1:nrow(dutch)
# of these, select only those which correspond to plural forms
plural_rownumbers = rownumbers[dutch.Number .== "plural"]
# filter plural rownumbers to only include those where the singular is available in the data
plural_rownumbers = [ro for ro in plural_rownumbers 
        if dutch[ro, :Lexeme] in dutch[dutch.Number .== "singular", :Lexeme]]
# sample 100 plural rows. To make this reproducible, first set a random seed.
Random.seed!(3)
test_rows = sample(plural_rownumbers,100, replace=false)
# the remaining row numbers should go into the training set
train_rows = setdiff(rownumbers, test_rows)

# create final training and test sets
dutch_train = dutch[train_rows,:]
dutch_test = dutch[test_rows,:]
dutch_train_test = vcat(dutch_train, dutch_test)

Create cue objects.

In [None]:
# create one cue object for the training and test data combined (will be needed below)
cue_obj_train_test = JudiLing.make_cue_matrix(dutch_train_test, grams=3, target_col="Word");

# create cue objects for the training and test data respectively
cue_obj_train, cue_obj_test = JudiLing.make_combined_cue_matrix(dutch_train, dutch_test, grams=3, target_col="Word");

Split the S matrix into test, train and a combined S matrix.

In [None]:
S_test = S[test_rows,:]
S_train = S[train_rows,:]

# we need to create this combined S matrix to make sure that the rows here are 
# in the same order as in dutch_train_test and cue_obj_train_test
S_train_test = vcat(S_train, S_test);

Create one F matrix based on the train AND test data, and one based only on the train data.

In [None]:
F_train_test = JudiLing.make_transform_matrix(cue_obj_train_test.C, S_train_test);
F_train = JudiLing.make_transform_matrix(cue_obj_train.C, S_train);

Inspect the accuracy of the F matrix trained only on the training data:

In [None]:
Shat_train = cue_obj_train.C * F_train
@show JudiLing.eval_SC(Shat_train, S_train, dutch_train, "Word")
Shat_test = cue_obj_test.C * F_train
@show JudiLing.eval_SC(Shat_test, S_test, S_train, dutch_test, dutch_train, "Word")

Create a G matrix based on the training data and predict the form matrix for both the training and test data.

In [None]:
G_train = JudiLing.make_transform_matrix(S_train, cue_obj_train.C);
Chat_train = S_train * G_train;
Chat_test = S_test * G_train;

Calculate maximum number of production steps.

In [None]:
max_t = JudiLing.cal_max_timestep(dutch_train, dutch_test, "Word")

Produce forms based on the F matrix trained on both the training and test data.

In [None]:
prod_test = JudiLing.learn_paths(dutch_train,
                                 dutch_test, 
                                 cue_obj_train.C, 
                                 S_test,
                                 F_train_test, # set F matrix trained on train and test data here
                                 Chat_test, 
                                 cue_obj_test.A, 
                                 cue_obj_train.i2f, 
                                 cue_obj_train.f2i,
                                 max_t=14, 
                                 threshold=0.001, 
                                 grams=3,
                                 is_tolerant = true, 
                                 tolerance = -0.1, 
                                 max_tolerance = 2, 
                                 target_col="Word");
JudiLing.eval_acc(prod_test, cue_obj_test)

Produce forms based on the F matrix trained on the training data only.

In [None]:
prod_test2 = JudiLing.learn_paths(dutch_train,
                                  dutch_test, 
                                  cue_obj_train.C, 
                                  S_test,
                                  F_train, # set F matrix trained on train data only here
                                  Chat_test, 
                                  cue_obj_test.A, 
                                  cue_obj_train.i2f, 
                                  cue_obj_train.f2i,
                                  max_t=14, 
                                  threshold=0.001, 
                                  grams=3,
                                  is_tolerant = true, 
                                  tolerance = -0.1, 
                                  max_tolerance = 2, 
                                  target_col="Word");

In [None]:
JudiLing.eval_acc(prod_test2, cue_obj_test)

Accuracy among top 10 candidates

In [None]:
JudiLing.eval_acc_loose(prod_test2, cue_obj_test.gold_ind)

## Inspecting shift vectors

This part of the code does not require any knowledge of JudiLing. For completeness' sake, we provide it in the following anyway.

First, we turn the dutch dataset into a wide dataset where each row provides the singular and plural form for each lexeme. We get rid of rows where the singular or plural form is missing.

In [None]:
dutch_wide = unstack(dutch, [:Lexeme, :WordCat, :Voice], :Number, :Ortho, allowduplicates=true)
dutch_wide = dropmissing(dutch_wide)

Next, we split up the S matrix, such that we have one matrix with the semantic vectors of all singulars, and one with all plurals. Moreover, we make sure that in both matrices the vectors are ordered according to `dutch_wide`. Practically, this means that in the first row of `S_singular` will be the semantic vector of the singular form of the word whose plural form's semantic vector is in the first row of `S_plural`.

In [None]:
singular_rownumbers = [findall(x->x==w, dutch.Ortho)[1] for w in dutch_wide.singular]
plural_rownumbers = [findall(x->x==w, dutch.Ortho)[1] for w in dutch_wide.plural]

S_singular = S[singular_rownumbers,:]
S_plural = S[plural_rownumbers,:]

Now, we can calculate the correlation between each pair of singular and plural semantic vector, and get the range as well as the median.

In [None]:
cors = diag(cor(S_singular, S_plural, dims = 2))
print(findmin(cors), findmax(cors), median(cors))

The shift vectors are calculated by subtracting the singular vectors from the plural vectors.

In [None]:
shift_vectors = S_plural .- S_singular

### TSNE

Next, we require the `TSne` and `Plots` libraries. If you have not done so before, you can install them with the following piece of code:

In [None]:
using TSne, Plots

We now run TSne on the shift vectors we have calculated above (note that a warning will show up, this can be savely ignored).

In [None]:
Random.seed!(3)
Y = tsne(shift_vectors, 2, 50, 1000, 30.0);

...and we can plot the result.

In [None]:
markercolors = [colorant"#FE8892",
                  colorant"#8FAADC",
                  colorant"#FE8892",
                  colorant"#8FAADC"]
labels = ["Noun non-alt.", "Verb non-alt.", "Noun alternating", "Verb alternating"]
markershapes = [:star4, :star4, :circle, :circle]
dutch_wide[!,"voice_wordcat"] = string.(dutch_wide.Voice, "_", dutch_wide.WordCat)
p = scatter(xlab="tSNE dimension 1", ylab="tSNE dimension 2", title="Fasttext vectors")
for (i, comb) in enumerate(["voiceless_noun", "voiceless_verb", "voiced_noun", "voiced_verb",])
    scatter!(Y[dutch_wide.voice_wordcat .== comb,1], Y[dutch_wide.voice_wordcat .== comb,2], 
            markershape = markershapes[i],
            markercolor = markercolors[i],
        markersize=3.5,
        markerstrokewidth=0.3,
    label = labels[i])
end
p

In [None]:
savefig("../fig/tsne_shift_vectors_ft.pdf")

### Linear Discriminant Analysis

As there is so far no good library for performing LDA for classification in julia, we now move to R. Luckily, we can do this directly from julia, by making use of the `RCall` library which can be installed as follows:

In [None]:
using RCall

Here, we put the shift vectors as well as the dutch wide dataset into R.

In [None]:
@rput shift_vectors
@rput dutch_wide

We load the MASS library in R (after installing it, if necessary):

In [None]:
R"""
# install.packages("MASS")

library(MASS)
"""

Now we run the LDA on the shift vectors and extract the prior probabilities:

In [None]:
R"""
ld = lda(x = shift_vectors, grouping = dutch_wide$Voice)
ld$prior
"""

And predict and evaluate the results:

In [None]:
R"""
pred = predict(ld, shift_vectors)$class

mean(pred == dutch_wide$Voice)
"""

In [None]:
R"""

table(dutch_wide$Voice)/length(dutch_wide$Voice)
"""

## Conclusion

Overall, we find that accuracies are similar with word2vec and fasttext:

Performance on all data:
- w2v EL: comprehension: 90.3%, production: 98.0%
- ft EL: comprehension: 83.9%, production: 98.0%
- w2v FIL: comprehension: 23.3%, production: 39.0%
- ft FIL: comprehension: 24.6%, production: 35.5%

Performance on held-out data (using the production model where both training and testing data are available for training the comprehension model):
- w2v EL: comprehension: 0.02%, production: 88%
- ft EL: comprehension: 0.01%, production: 96%

LDA classification accuracies:
- w2v: 79.5%
- ft: 77.9%

Our findings therefore hold for both word2vec and fasttext vectors. Visual inspection of the tSNE plots for fasttext and word2vec suggests that there are no large differences in terms of clustering either.