# Chapter 12.7: Estonian nouns

## Preparation

### Exercise 1

In [None]:
using JudiLing, DataFrames

Load the estonian dataset from the `dat` directory.

In [None]:
estonian = JudiLing.load_dataset("../dat/estonian.csv")
first(estonian, 5)

In [None]:
size(estonian)

In [None]:
names(estonian)

### Exercise 2

Load word embeddings directly from fasttext. If downloading the vectors first time, you will need to agree to the download under the provided license. For this, enter `y` in the `stdin` input field and press enter.

The function directly creates a subset of the provided dataframe which only contains word forms which can be found in the word embeddings. This means that this new dataset will most likely be smaller than the original one.

In [None]:
estonian, S = JudiLing.load_S_matrix_from_fasttext(estonian, :et, target_col=:Word)

In [None]:
size(estonian)

About 2000 words were excluded

## Analysis 1: Learnability of full system

### Exercise 3

Create a cue object using trigrams.

In [None]:
cue_obj = JudiLing.make_cue_matrix(estonian, grams=3, target_col="Word")

### Exercise 4

Compute the $\mathbf{F}$ and $\mathbf{G}$ matrices.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S)
G = JudiLing.make_transform_matrix(S, cue_obj.C)

... and the $\hat{\mathbf{S}}$ and $\hat{\mathbf{C}}$ matrices...

In [None]:
Shat = cue_obj.C * F
Chat = S * G

### Exercise 5

Evaluate comprehension accuracy.

In [None]:
JudiLing.eval_SC(Shat, S, estonian, "Word")

### Exercise 6

Apply the production algorithm with threshold 0.01:

In [None]:
res_learn = JudiLing.learn_paths(estonian, cue_obj, S, F, Chat, threshold=0.01)

... and evaluate

In [None]:
JudiLing.eval_acc(res_learn, cue_obj)

## Analysis 2: Performance on unseen words

### Exercise 7

First, split the dataset into training and test set. The training set contains all word forms which are in the corpus, and the test set all forms which are not.

In [None]:
estonian_train = estonian[estonian.inCorpus .== true,:]
estonian_test = estonian[estonian.inCorpus .== false,:]

In [None]:
size(estonian_test)

Create cue objects for both sets using the `make_combined_cue_matrix` function:

In [None]:
cue_obj_train, cue_obj_test = JudiLing.make_combined_cue_matrix(estonian_train, estonian_test, grams=3, target_col="Word")

Subset the S matrix for the training and test set:

In [None]:
S_train = S[estonian.inCorpus .== true,:]
S_test = S[estonian.inCorpus .== false,:]

Compute the $\mathbf{F}$ and $\mathbf{G}$ matrices.

In [None]:
F_train = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
G_train = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)

... and the $\hat{\mathbf{S}}$ and $\hat{\mathbf{C}}$ matrices for the training set...

In [None]:
Shat_train = cue_obj_train.C * F_train
Chat_train = S_train * G_train

Evaluate comprehension accuracy on the training data

In [None]:
JudiLing.eval_SC(Shat, S)

Apply the production algorithm to the training data...

In [None]:
res_learn_train = JudiLing.learn_paths(estonian_train, 
                                        cue_obj_train, S_train, F_train, Chat_train, threshold=0.01)

...and compute the accuracy:

In [None]:
JudiLing.eval_acc(res_learn_train, cue_obj_train)

Now moving to the unseen data, calculate the $\hat{\mathbf{S}}$ and $\hat{\mathbf{C}}$ matrices for the test set

In [None]:
Shat_test = cue_obj_test.C * F_train
Chat_test = S_test * G_train

and evaluate without taking into account the training data...

In [None]:
JudiLing.eval_SC(Shat_test, S_test)

... and taking into account the training data

In [None]:
JudiLing.eval_SC(Shat_test, S_test, S_train)

Apply the production algorithm

In [None]:
res_learn_test= JudiLing.learn_paths(
    estonian_train,
    estonian_test,
    cue_obj_train.C,
    S_test,
    F_train,
    Chat_test,
    cue_obj_train.A,
    cue_obj_train.i2f,
    cue_obj_train.f2i, # api changed in 0.3.1
    gold_ind = cue_obj_train.gold_ind,
    Shat_val = Shat_test,
    check_gold_path = false,
    max_t = JudiLing.cal_max_timestep(estonian_test, :Word),
    max_can = 10,
    grams = 3,
    threshold = 0.01,
    tokenized = false,
    sep_token = "_",
    keep_sep = false,
    target_col = :Word,
    verbose = true,
);

And evaluate

In [None]:
JudiLing.eval_acc(res_learn_test, cue_obj_test)

The production accuracy isn't too impressive, we now try what happens if we turn on the tolerance mode:

In [None]:
res_learn_test= JudiLing.learn_paths(
    estonian_train,
    estonian_test,
    cue_obj_train.C,
    S_test,
    F_train,
    Chat_test,
    cue_obj_train.A,
    cue_obj_train.i2f,
    cue_obj_train.f2i, # api changed in 0.3.1
    gold_ind = cue_obj_train.gold_ind,
    Shat_val = Shat_test,
    check_gold_path = false,
    max_t = JudiLing.cal_max_timestep(estonian_test, :Word),
    max_can = 10,
    grams = 3,
    threshold = 0.01,
    is_tolerant=true,
    max_tolerance=1,
    tolerance=-1.,
    target_col = :Word,
    verbose = true,
);

This looks much better.

In [None]:
JudiLing.eval_acc(res_learn_test, cue_obj_test)

Finally, we want to know whether the production performance varies for principal parts and non-principal parts. For this, we first turn the output of `learn_paths` into a dataframe...

In [None]:
prod_test = JudiLing.write2df(res_learn_test, estonian_test, cue_obj_train, cue_obj_test, target_col="Word")

...only keep the form with the highest support...

In [None]:
prod_test_preds = prod_test[(prod_test.isbest .== true),:]

Compute accuracy for the non-principal parts...

In [None]:
using Statistics

In [None]:
prod_test_preds_non_pp = prod_test_preds[Not((estonian_test.Number .== "sg") .& 
                    ((estonian_test.Case .== "gen") .| (estonian_test.Case .== "nom") .| (estonian_test.Case .== "part"))),:]
mean(prod_test_preds_non_pp.iscorrect)

...and for the principal parts

In [None]:
prod_test_preds_pp = prod_test_preds[(estonian_test.Number .== "sg") .& 
                    ((estonian_test.Case .== "gen") .| (estonian_test.Case .== "nom") .| (estonian_test.Case .== "part")),:]
mean(prod_test_preds_pp.iscorrect)

It looks like here the performance for principal parts is higher, but there are very few principal parts in the held-out data:

In [None]:
estonian_test[(estonian_test.Number .== "sg") .& 
                    ((estonian_test.Case .== "gen") .| (estonian_test.Case .== "nom") .| (estonian_test.Case .== "part")),:]

## Analysis 3: Training with and without principal parts

Load `StatsBase` and `Random` for random sampling

In [None]:
import Pkg; Pkg.add("StatsBase")
using StatsBase, Random

Split up the estonian data into a test dataset with no principal parts included and two training data sets, one without any principal parts and one with principal parts included.

In [None]:
# all rows in the dataframe
rows = collect(1:size(estonian,1))
# all rows without principal parts
rows_non_pp = rows[Not((estonian.Number .== "sg") .& ((estonian.Case .== "gen") .| (estonian.Case .== "nom") .| (estonian.Case .== "part"))),:]
# sample 800 test rows from the rows without principal parts
Random.seed!(42)
rows_test = sample(rows_non_pp, 800, replace = false)
# select all rows without principal parts which are not in the test rows
rows_train_none = collect(setdiff(Set(rows_non_pp),Set(rows_test)))

# subset a test set from the estonian dataframe
estonian_test_pp = estonian[rows_test,:]
# a training set with all rows which are not in the test set (this contains principal parts)
estonian_train_pp_all = estonian[Not(rows_test),:]
# a training set without any principal parts
estonian_train_pp_none = estonian[rows_train_none,:]

### Exercise 8

First, create cue objects for the larger training data and the test data

In [None]:
cue_obj_train_pp_all, 
    cue_obj_test_pp = JudiLing.make_combined_cue_matrix(estonian_train_pp_all,
                                                        estonian_test_pp,
                                                        grams=3, target_col="Word");

### Exercise 9

Now create an additional cue object for the smaller training set. It is a subset of the larger training set, so we can reuse the `i2f` and `f2i` mappings created above. To do this, all we need to do is provide `cue_obj_train_pp_all` to the `make_cue_matrix` function.

In [None]:
cue_obj_train_pp_none = JudiLing.make_cue_matrix(estonian_train_pp_none,
                                                cue_obj_train_pp_all,
                                                grams=3, target_col="Word");

### Exercise 10

Now subset the S matrix for three datasets.

In [None]:
S_test_pp = S[rows_test,:]
S_train_pp_all = S[Not(rows_test),:]
S_train_pp_none = S[rows_train_none,:]

### Exercise 11

Compute mapping matrices.

In [None]:
F_train_pp_all = JudiLing.make_transform_matrix(cue_obj_train_pp_all.C, S_train_pp_all)
G_train_pp_all = JudiLing.make_transform_matrix(S_train_pp_all, cue_obj_train_pp_all.C)

In [None]:
F_train_pp_none = JudiLing.make_transform_matrix(cue_obj_train_pp_none.C, S_train_pp_none)
G_train_pp_none = JudiLing.make_transform_matrix(S_train_pp_none, cue_obj_train_pp_none.C)

### Exercise 12

And predicted matrices for the test set.

In [None]:
Shat_test_pp_all = cue_obj_test_pp.C * F_train_pp_all
Chat_test_pp_all = S_test_pp * G_train_pp_all

In [None]:
Shat_test_pp_none = cue_obj_test_pp.C * F_train_pp_none
Chat_test_pp_none = S_test_pp * G_train_pp_none

### Exercise 13

Compute comprehension test accuracy based on the two training datasets:

In [None]:
JudiLing.eval_SC(Shat_test_pp_all, S_test_pp)

In [None]:
JudiLing.eval_SC(Shat_test_pp_none, S_test_pp)

### Exercise 14

Apply the production algorithm:

In [None]:
res_learn_test_pp_all= JudiLing.learn_paths(
    estonian_train_pp_all,
    estonian_test_pp,
    cue_obj_train_pp_all.C,
    S_test_pp,
    F_train_pp_all,
    Chat_test_pp_all,
    cue_obj_train_pp_all.A,
    cue_obj_train_pp_all.i2f,
    cue_obj_train_pp_all.f2i, # api changed in 0.3.1
    gold_ind = cue_obj_train_pp_all.gold_ind,
    Shat_val = Shat_test_pp_all,
    check_gold_path = false,
    max_t = JudiLing.cal_max_timestep(estonian_test_pp, :Word),
    max_can = 10,
    grams = 3,
    threshold = 0.01,
    is_tolerant=true,
    max_tolerance=1,
    tolerance=-1.,
    target_col = :Word,
    verbose = true,
);

In [None]:
JudiLing.eval_acc(res_learn_test_pp_all, cue_obj_test_pp)

In [None]:
res_learn_test_pp_none= JudiLing.learn_paths(
    estonian_train_pp_none,
    estonian_test_pp,
    cue_obj_train_pp_none.C,
    S_test_pp,
    F_train_pp_none,
    Chat_test_pp_none,
    cue_obj_train_pp_none.A,
    cue_obj_train_pp_none.i2f,
    cue_obj_train_pp_none.f2i, # api changed in 0.3.1
    gold_ind = cue_obj_train_pp_none.gold_ind,
    Shat_val = Shat_test_pp_none,
    check_gold_path = false,
    max_t = JudiLing.cal_max_timestep(estonian_test_pp, :Word),
    max_can = 10,
    grams = 3,
    threshold = 0.01,
    is_tolerant=true,
    max_tolerance=1,
    tolerance=-1.,
    target_col = :Word,
    verbose = true,
);

In [None]:
JudiLing.eval_acc(res_learn_test_pp_none, cue_obj_test_pp)