# Chapter 13.1: Korean verbs

## Exercise 1

Load the usual packages:

In [None]:
using DataFrames, JudiLing

## Data preparation

### Exercise 2

Load the korean dataset and inspect the first rows

In [None]:
korean = JudiLing.load_dataset("../dat/korean.csv")
first(korean, 5)

Inspect the size of the loaded dataset:

In [None]:
size(korean)

## Model

### Exercise 3

Create a cue object, using bi-syllables:

In [None]:
cue_obj = JudiLing.make_cue_matrix(korean, 
                                    grams=2, 
                                    target_col=:Word, 
                                    tokenized=true,
                                    sep_token="_", 
                                    keep_sep=true);

In [None]:
JudiLing.display_matrix(korean, :Word, cue_obj, cue_obj.C, :C)

### Exercise 4

Simulate semantic vectors:

In [None]:
S = JudiLing.make_S_matrix(korean,
                            [:Lexeme],
                            [:Honorifics, :Tense, :SpeechLevel, :IllocutionaryForce],
                            ncol=size(cue_obj.C, 2));

In [None]:
JudiLing.display_matrix(korean, :Word, cue_obj, S, :S)

### Exercises 5 + 6

Compute the mapping matrix F from form to meaning, as well as the predicted semantic matrix:

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S)
Shat = cue_obj.C * F

### Exercise 7

Evaluate comprehension accuracy using lenient evaluation:

In [None]:
JudiLing.eval_SC(Shat, S, korean, :Word)

### Exercise 8

Compute the production mapping G and the predicted form matrix:

In [None]:
G = JudiLing.make_transform_matrix(S, cue_obj.C)
Chat = S * G

### Exercise 9

Run the learn_paths algorithm:

In [None]:
res = JudiLing.learn_paths(korean, cue_obj, S, F, Chat, Shat_val=Shat)

### Exercise 10

Evaluate production accuracy:

In [None]:
JudiLing.eval_acc(res, cue_obj)

### Exercise 11

Now we redo this analysis for testing also on held-out data.

First, split the data carefully:

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
"../dat/korean.csv", "korean", "../dat/careful",
[:Lexeme, :Honorifics, :Tense, :SpeechLevel, :IllocutionaryForce],
n_grams_target_col = "Word",
grams = 2,
val_sample_size = 300,
random_seed = 42,
n_grams_tokenized=true,
n_grams_sep_token="_", 
n_grams_keep_sep=true)

Create cue and semantic matrices:

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train, 
                                    data_val,
                                    grams=2, 
                                    target_col=:Word, 
                                    tokenized=true,
                                    sep_token="_", 
                                    keep_sep=true);

In [None]:
S_train, S_val = JudiLing.make_S_matrix(data_train, 
                                    data_val,
                            [:Lexeme],
                            [:Honorifics, :Tense, :SpeechLevel, :IllocutionaryForce],
                            ncol=size(cue_obj_train.C, 2));

Train F and G matrices on training data:

In [None]:
F_train = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
G_train = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)

Predict cue and semantic matrices for validation data:

In [None]:
Shat_val = cue_obj_val.C * F_train
Chat_val = S_val * G_train

Evaluate comprehension:

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, "Word")

Run learn paths on the validation:

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        F_train,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=JudiLing.cal_max_timestep(data_train, data_val, "Word"),
        threshold=0.001,
        grams=2,
        target_col="Word",
        tokenized=true,
        sep_token="_", 
        keep_sep=true)

Evaluate production accuracy:

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

### Exercise 12

Moving on to using Hangul spelling instead of pronunciations.

First, split the data:

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
"../dat/korean.csv", "korean_han", "../dat/careful",
[:Lexeme, :Honorifics, :Tense, :SpeechLevel, :IllocutionaryForce],
n_grams_target_col = "Hangul",
grams = 2,
val_sample_size = 300,
random_seed = 42)

Create cue and semantic matrices:

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train, 
                                    data_val,
                                    grams=2, 
                                    target_col=:Hangul);

In [None]:
S_train, S_val = JudiLing.make_S_matrix(data_train, 
                                    data_val,
                            [:Lexeme],
                            [:Honorifics, :Tense, :SpeechLevel, :IllocutionaryForce],
                            ncol=size(cue_obj_train.C, 2));

Train F and G matrices and predict semantic and cue matrices for the validation data:

In [None]:
F_train = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
G_train = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)

In [None]:
Shat_val = cue_obj_val.C * F_train
Chat_val = S_val * G_train

Evaluate comprehension:

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, "Hangul")

Run learn paths:

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        F_train,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=JudiLing.cal_max_timestep(data_train, data_val, "Hangul"),
        threshold=0.001,
        grams=2,
        target_col="Hangul")

Evaluate production accuracy:

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)