# Chapter 10: Cross validation

## Preparations

Load necessary packages.

In [None]:
using DataFrames, JudiLing

## Working with randomly split training and testing data

For randomly splitting the data we make use of the `loading_data_randomly_split` function. It takes as arguments the filepath from where the data should be loaded (here: `"../dat/dutch.csv"`), the filepath to where the resulting split should be saved (here: `"../dat/cv_random"`), a prefix of how to call the resulting datafiles (`"dutch"`), how many word forms the validation should include (`val_sample_size`) and finally, a random seed can be specified to make the output of the function reproducible.

In [None]:
data_train, data_val = JudiLing.loading_data_randomly_split(
     "../dat/dutch.csv", "../dat/cv_random", "dutch",
     val_sample_size = 300,
     random_seed = 42);

Now we can create cue objects for the training and test data. Given that there are novel cues present in the testing data, we use the `make_combined_cue_matrix` function.

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train,
                                   data_val,
                                   grams=3,
                                   target_col="Ortho");

Similarly for semantic matricex, we use the `make_combined_S_matrix` for novel concepts

In [None]:
S_train, S_val = JudiLing.make_combined_S_matrix(
                           data_train,
                           data_val,
                           ["Lexeme"],
                           ["Number", "WordCat"],
                           ncol=200);

For empirical vectors (e.g., word2vec), the semantic matrices for training and testing data have to be done manually.

In [None]:
# reading in word list and word2vec vectors
S, words = JudiLing.load_S_matrix("../dat/dutch_w2v.csv"; header = false, sep = ",");

Get indices of word forms in training and testing data respectively.

In [None]:
# for homophones, let's just select the first matched word.
# As homophones are not distinguished by word2vec, we will get the same vector anyway.
idx_train = [findall(x->x==w, words)[1] for w in data_train.Ortho];
idx_test = [findall(x->x==w, words)[1] for w in data_val.Ortho];

Divide the S matrix into training and test data.

In [None]:
S_train_w2v = S[idx_train,:];
S_val_w2v = S[idx_test,:];

(Fasttext vectors can be downloaded and directly split according to the supplied train and validation dataframes like this:

In [None]:
# uncomment to run

# data_train_small, data_val_small, S_ft_train, S_ft_val = JudiLing.load_S_matrix_from_fasttext(data_train,
#                                      data_val,
#                                      :nl;
#                                      target_col=:Ortho,

Note that if you opt to use fasttext vectors, you will have to regenerate the cue objects, since some wordforms are excluded from the datasets for which no fasttext vectors exist.)

### Comprehension
We now train `F` based on the training data only. Zeros in the F matrix represent novel cues that have not receive any training.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)

Next, we can evaluate the training and test accuracy. First, for evaluating training data, we get a predicted semantic matrix for the training data:

In [None]:
Shat_train = cue_obj_train.C * F

JudiLing.eval_SC(Shat_train, S_train)

For evaluating on the validation data, we use the `cue_obj_val`, but the same mapping matrix `F`:

In [None]:
Shat_val = cue_obj_val.C * F;

For evaluation, we want to not only compare the predicted semantic vectors with the vectors in the `S_val` matrix, but also with the semantic vectors of the training data. Therefore, we provide both:

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train)

Instead of only counting words as correct when the target is the most correlated word, but also when it is among the top 5 most correlated, we make use of `eval_SC_loose`. It takes the same parameters as `eval_SC`, and additionally the number of top `k` correlated neighbours.

In [None]:
# top 5 acccuracy
JudiLing.eval_SC_loose(Shat_val, S_val, S_train, 5)

In [None]:
# top 10 acccuracy
JudiLing.eval_SC_loose(Shat_val, S_val, S_train, 10)

To get a more precise account of the comprehension results:

In [None]:
acc_comp = JudiLing.accuracy_comprehension(S_val,
S_train,
Shat_val,
data_val,
data_train,
target_col=:Ortho,
base=[:Lexeme],
inflections=[:WordCat, :Number]);

### Production

First, compute production mapping and predicted form matrix (see tutorial 3):

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C);
Chat_val = S_val * G;

Now we run the `learn_paths` function to actually produce forms (see tutorial 5). For evaluating training and testing data we need to use more parameters. Firstly, we need to compute the maximum number of trigrams any word in the training and test data includes:

In [None]:
# the number of steps
JudiLing.cal_max_timestep(data_train, data_val, "Ortho")

Next, we run the actual `learn_paths` function. Details on the individual parameters are provided below and in the help pages. We need to specify the maximum number of trigrams that we have calculated above in the `max_t` parameter. Similar to our earlier use of `learn_paths` we can again set a `threshold`. We also supply information about which `grams` size we used and the target column in the dataset.

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        F,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=14,
        threshold=0.001,
        grams=3,
        target_col="Ortho");

We can again evaluate the produced forms for the validation data based using `eval_acc`:

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

We can also make use of the so-called tolerance mode, where the algorithm will tolerate a specified number of trigrams with support below the threshold within each word form. For this, we set `is_tolerant=true`, `tolerance=-0.1` (this controls how much tolerated trigrams with support below the general `threshold` need to be supported) and finally the maximal number of trigrams with lower support should be tolerated (`max_tolerance=1`).

In [None]:
prod_val_tol = JudiLing.learn_paths(
        data_train,         
        data_val,          
        cue_obj_train.C,  
        S_val,            
        F,             
        Chat_val,        
        cue_obj_val.A,     
        cue_obj_train.i2f, 
        cue_obj_train.f2i,  
        max_t=14,
        threshold=0.001,
        grams=3,
        target_col="Ortho",
        is_tolerant = true,
        tolerance = -0.1,
        max_tolerance = 1);

In [None]:
JudiLing.eval_acc(prod_val_tol, cue_obj_val)

## Working with carefully split training and testing data

For carefully splitting training and testing data, we ensure that the testing data only includes cues (i.e. trigrams) and lexomes and inflectional features that have been seen during training. It's setup is similar to `loading_data_randomly_split`, but additionally we need to specify the columns with lexome and inflectional features (in this case `["Lexeme", "Number", "WordCat"]` and the column with the target forms (`n_grams_target_col = "Ortho"`) as well as the `grams` size.

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
     "../dat/dutch.csv", "dutch", "../dat/cv_careful",
     ["Lexeme", "Number", "WordCat"],
     n_grams_target_col = "Ortho",
     grams = 3,
     val_sample_size = 300,
     random_seed = 42);

Given that we have made sure that no novel cues are present in the testing data, we can simply use the `make_cue_matrix` function.

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(
                                   data_train,
                                   data_val,
                                   grams=3,
                                   target_col="Ortho");

Creating semantic matrices using simulated vectors

In [None]:
S_train, S_val = JudiLing.make_combined_S_matrix(
                           data_train,
                           data_val,
                           ["Lexeme"],
                           ["Number", "WordCat"],
                           ncol=200);

### Comprehension

Compute comprehension mapping based on training data.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train);

Evaluate on training...

In [None]:
Shat_train = cue_obj_train.C * F;
JudiLing.eval_SC(Shat_train, S_train)

...and testing data:

In [None]:
Shat_val = cue_obj_val.C * F;
JudiLing.eval_SC(Shat_val, S_val, S_train)

In [None]:
# top 5 acccuracy
JudiLing.eval_SC_loose(Shat_val, S_val, S_train, 5)

In [None]:
# top 10 acccuracy
JudiLing.eval_SC_loose(Shat_val, S_val, S_train, 10)

### Production

Compute the production mapping based on the training data and the predicted form matrix for the validation data.

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C);
Chat_val = S_val * G;

Run the `learn_paths` function as described above:

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        F,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=14,
        threshold=0.001,
        grams=3,
        target_col="Ortho");

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

Making use of tolerance mode:

In [None]:
prod_val_tol = JudiLing.learn_paths(
        data_train,         
        data_val,          
        cue_obj_train.C,  
        S_val,            
        F,             
        Chat_val,        
        cue_obj_val.A,     
        cue_obj_train.i2f, 
        cue_obj_train.f2i,  
        max_t=14,
        threshold=0.001,
        grams=3,
        target_col="Ortho",
        is_tolerant = true,
        tolerance = -0.1,
        max_tolerance = 1);

In [None]:
JudiLing.eval_acc(prod_val_tol, cue_obj_val)

## Exercises

### Exercise 1

Creating a random data split

In [None]:
data_train, data_val = JudiLing.loading_data_randomly_split(
     "../dat/latin.csv", "../dat/cv_random", "latin",
     val_sample_size = 50,
     random_seed = 42);

Creating cue matrices.

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train,
                                   data_val,
                                   grams=3,
                                   target_col="Word");

Creating S matrices:

In [None]:
S_train, S_val = JudiLing.make_combined_S_matrix(
                           data_train,
                           data_val,
                           ["Lexeme"],
                           ["Person", "Number", "Tense", "Voice", "Mood"],
                           ncol=300);

### Exercise 2

Training a comprehension mapping using the training data:

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)

Predicting semantic vectors for both the training and validation data

In [None]:
Shat_train = cue_obj_train.C * F
Shat_val = cue_obj_val.C * F

Evaluation without taking into account homographs:

In [None]:
JudiLing.eval_SC(Shat_train, S_train)

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train)

Evaluation while taking into account homographs:

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train, "Word")

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, "Word")

### Exercise 3

Producing word forms for the validation data

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)

In [None]:
Chat_train = S_train * G
Chat_val = S_val * G

In [None]:
JudiLing.cal_max_timestep(data_train, data_val, "Word")

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        F,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=16,
        threshold=0.001,
        grams=3,
        target_col="Word",
        verbose=true);

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

### Exercise 4

Repeat the analysis above with a careful split.

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
     "../dat/latin.csv", "latin", "../dat/cv_careful",
     ["Lexeme", "Person", "Number", "Tense", "Voice", "Mood"],
     n_grams_target_col = "Word",
     grams = 3,
     val_sample_size = 50,
     random_seed = 42);

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train,
                                   data_val,
                                   grams=3,
                                   target_col="Word");

In [None]:
S_train, S_val = JudiLing.make_combined_S_matrix(
                           data_train,
                           data_val,
                           ["Lexeme"],
                           ["Person", "Number", "Tense", "Voice", "Mood"],
                           ncol=300);

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
Shat_train = cue_obj_train.C * F
Shat_val = cue_obj_val.C * F

In [None]:
JudiLing.eval_SC(Shat_train, S_train)

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train)

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)
Chat_train = S_train * G
Chat_val = S_val * G

In [None]:
JudiLing.cal_max_timestep(data_train, data_val, "Word")

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        F,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=16,
        threshold=0.001,
        grams=3,
        target_col="Word");

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

The comprehension mapping is so good, that there is no big difference between the random and careful split, but there is a bit of an improvement for the careful split in production.