# Chapter 13: JudiLingMeasures

First, install JudiLingMeasures.

In [None]:
using Pkg
Pkg.add("JudiLingMeasures")

Load JudiLingMeasures together with JudiLing, DataFrames and CSV.

In [None]:
using Flux
using JudiLing, JudiLingMeasures, DataFrames, CSV

## Preparations

Load the DLP data from Keuleers et al (2010). The data is available [here](https://osf.io/uw7t6/). If you haven't done so before, download the `dlp-items.txt` and `dlp-stimuli.txt` files and store them in the `dat` directory.

In [None]:
dlp = JudiLing.load_dataset("../dat/dlp-stimuli.txt", delim="\t");
# only keep words
dlp = dlp[dlp[:,"celex.frequency"] .!= "NA",:]
# only keep relevant columns
dlp = dlp[:,["spelling", "celex.frequency", "coltheart.N"]];

Load word embeddings.

In [None]:
S, words = JudiLing.load_S_matrix("../dat/dlp_w2v.csv", header=true, sep=",")

Only keep words from the DLP for which word embeddings are available.

In [None]:
dlp = filter(row -> lowercase(row.spelling) in lowercase.(words), dlp)

Make sure the order of the word forms in the `dlp` dataset and the semantic matrix `S` are the same.

In [None]:
all(dlp.spelling .== words)

Create the cue objet.

In [None]:
cue_obj = JudiLing.make_cue_matrix(dlp,
                                   grams=3,
                                   target_col="spelling");

Calculate the F and G matrices.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S)
G = JudiLing.make_transform_matrix(S, cue_obj.C)

Calculate $\hat{S}$ (`Shat`) and $\hat{C}$ (`Chat`).

In [None]:
Shat = cue_obj.C * F
Chat = S * G

Use the `learn_paths_rpi` function for computing the produced word forms. The function returns three objects:
- `res_learn`: standard output of `learn_paths` containing the prduced word forms
- `gpi_learn`: contains path supports for the targeted word forms
- `rpi_learn`: contains path supports for the predicted word forms

For improved performance we set `treshold=0.005`, which means that running the next cell takes a few minutes. It is possible to skip this step and only calculate measures which do not require path supports.

In [None]:
res_learn, gpi_learn, rpi_learn = JudiLing.learn_paths_rpi(
    dlp,
    dlp,
    cue_obj.C,
    S,
    F,
    Chat,
    cue_obj.A,
    cue_obj.i2f,
    cue_obj.f2i, # api changed in 0.3.1
    gold_ind = cue_obj.gold_ind,
    Shat_val = Shat,
    check_gold_path = true,
    max_t = JudiLing.cal_max_timestep(dlp, :spelling),
    max_can = 10,
    grams = 3,
    threshold = 0.005,
    tokenized = false,
    sep_token = "_",
    keep_sep = false,
    target_col = :spelling,
    verbose = true,
);

Compute the production accuracy.

In [None]:
JudiLing.eval_acc(res_learn, cue_obj)

# Computing measures

Compute all available measures (only works if you ran the cell with `learn_paths_rpi` above). Due to the size of the dataset we set `low_cost_measures_only=true`, so the function will only calculate computationally light-weight measures. If you can afford to wait for a while, or have a smaller dataset, you can also set this parameter to `false`.

In [None]:
all_measures = JudiLingMeasures.compute_all_measures_train(dlp, # the data of interest
                                                     cue_obj, # the cue_obj of the training data
                                                     Chat, # the Chat of the data of interest
                                                     S, # the S matrix of the data of interest
                                                     Shat, # the Shat matrix of the data of interest
                                                     F, # the F matrix
                                                     G, # the G matrix
                                                     res_learn_train=res_learn,
                                                     rpi_learn_train=rpi_learn,
                                                     gpi_learn_train=gpi_learn,
                                                     low_cost_measures_only=true); 

Alternatively, only measures without path supports can be calculated by simply not providing the outputs of `learn_paths_rpi` to the function:

In [None]:
all_measures_no_path_supports = JudiLingMeasures.compute_all_measures_train(dlp, # the data of interest
                                                     cue_obj, # the cue_obj of the training data
                                                     Chat, # the Chat of the data of interest
                                                     S, # the S matrix of the data of interest
                                                     Shat, # the Shat matrix of the data of interest
                                                     F, # the F matrix
                                                     G, # the G matrix
                                                     low_cost_measures_only=true); 

You can compare which measures have been calculated by the three methods:

In [None]:
names(all_measures)

In [None]:
names(all_measures_no_path_supports)

In [None]:
first(all_measures, 10)

Save measures. If you didn't calculate the full set of measures, make sure you change `all_measures` to `all_measures_no_path_supports`. It does not matter for the following analysis which of the two datasets you work with.

In [None]:
CSV.write("../res/dlp_measures.csv", all_measures)

## Modelling behavioural data with the calculated measures

Please find the R code in the next notebook.

## Computing measures for DDL models

Training DDL comprehension and production models. For demonstration purposes we only train for one epoch here:

In [None]:
res_comp = JudiLing.get_and_train_model(cue_obj.C,
                                        S,
                                        "../res/dlp_comp.bson", 
                                        verbose=true,
                                        n_epochs=1);
model_prod = Chain(
            Dense(size(S, 2) => 1000, relu),   # activation function inside layer
            Dense(1000 => size(cue_obj.C, 2)),
            sigmoid) |> gpu    
res_prod = JudiLing.get_and_train_model(S,
                                        cue_obj.C,
                                        "../res/dlp_prod.bson", 
                                        model=model_prod,
                                        loss_func=Flux.binarycrossentropy,
                                        verbose=true,
                                        n_epochs=1);

Predicting $\hat{\mathbf{S}}$ and $\hat{\mathbf{C}}$ matrices:

In [None]:
Shat = JudiLing.predict_from_deep_model(res_comp.model, cue_obj.C)
Chat = JudiLing.predict_from_deep_model(res_prod.model, S)

Compute measures:

In [None]:
all_measures = JudiLingMeasures.compute_all_measures_train(dlp, # the data of interest
                                                    cue_obj, # the cue_obj of the training data
                                                    Chat, # the Chat of the data of interest
                                                    S, # the S matrix of the data of interest
                                                    Shat, # the Shat matrix of the data of interest
                                                    low_cost_measures_only=true);

In [None]:
all_measures

## Exercises

Load the latin dataset and setup C and S matrices.

In [None]:
latin = JudiLing.load_dataset("../dat/latin.csv")

In [None]:
cue_obj = JudiLing.make_cue_matrix(latin, grams=3, target_col="Word");
S = JudiLing.make_S_matrix(
                           latin,
                           ["Lexeme"],
                           ["Person", "Number", "Tense", "Voice", "Mood"],
                           ncol=300);

Compute F and G and predict semantic and form matrices.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S)
G = JudiLing.make_transform_matrix(S, cue_obj.C)

In [None]:
Shat = cue_obj.C * F
Chat = S * G

Produce wordforms:

In [None]:
res_learn, gpi_learn, rpi_learn = JudiLing.learn_paths_rpi(
    latin,
    latin,
    cue_obj.C,
    S,
    F,
    Chat,
    cue_obj.A,
    cue_obj.i2f,
    cue_obj.f2i, # api changed in 0.3.1
    gold_ind = cue_obj.gold_ind,
    Shat_val = Shat,
    check_gold_path = true,
    max_t = JudiLing.cal_max_timestep(latin, :Word),
    max_can = 10,
    grams = 3,
    target_col = :Word,
    verbose = true,
);

In [None]:
JudiLing.eval_acc(res_learn, cue_obj)

Calculate measures.

In [None]:
latin_measures = JudiLingMeasures.compute_all_measures_train(latin, # the data of interest
                                                     cue_obj, # the cue_obj of the training data
                                                     Chat, # the Chat of the data of interest
                                                     S, # the S matrix of the data of interest
                                                     Shat, # the Shat matrix of the data of interest
                                                     F, # the F matrix
                                                     G, # the G matrix
                                                     res_learn_train=res_learn,
                                                     rpi_learn_train=rpi_learn,
                                                     gpi_learn_train=gpi_learn,
                                                     low_cost_measures_only=false); 

# References

Keuleers, E., Diependaele, K., and Brysbaert, M. (2010). Practice effects in large-scale visual word recognition studies: A lexical decision study on 14,000 dutch mono-and disyllabic words and nonwords. Frontiers in psychology, 1:174.