# Chapter 12: Deep discriminative learning in JudLing

To use deep learning in JudiLing, we first have to install Julia's deep learning library `Flux`:

In [None]:
using Pkg
Pkg.add("Flux")

(Note: if you want to accelerate training by utilising a GPU, follow the instructions [here](https://fluxml.ai/Flux.jl/stable/gpu/).)

Now, we can make both Flux and JudiLing available to our session. Here it is important that we first load Flux and only then load JudiLing, so that JudiLing is aware that Flux is available. In this way, JudiLing will also make available to us the deep learning functionality.

In [None]:
using Flux
using JudiLing
using DataFrames, CSV, Plots

# Preparation

First, we do a careful split along Lexeme, Number and WordCat. We model using triphones, so we also make sure that each triphone in the validation data has occured in the training data. We hold out 300 data points in the validation data.

In [None]:
data_train, data_val = JudiLing.loading_data_careful_split(
"../dat/dutch.csv", "dutch", joinpath(@__DIR__, "..", "dat", "careful"),
["Lexeme", "Number", "WordCat"],
n_grams_target_col = "Word",
grams = 3,
val_sample_size = 300,
random_seed = 42)

Load embeddings from fasttext.

In [None]:
data_train, data_val, S_train, S_val = JudiLing.load_S_matrix_from_fasttext(data_train, data_val, :nl, 
                                                                        target_col=:Ortho)

In [None]:
size(data_train)

In [None]:
size(data_val)

Get cue objects for training and validation data.

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train, data_val, grams=3,
                                                                target_col="Word")

# Comprehension

## Baseline

Train comprehension mapping.

In [None]:
@time F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)

Evaluate training and testing accuracy.

In [None]:
Shat_train = cue_obj_train.C * F
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

In [None]:
Shat_val = cue_obj_val.C * F
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

## Deep default model, training data only

Train model, only taking into account training data. The `get_and_train_model` function receives the input data (`cue_obj_train.C`) and the target data (`S_train`). The model is saved to `"../res/comp_train_only.bcomp"` and the function returns losses for each epoch for both the training data and theoretically also the validation data. However, since no validation data has been supplied, the validation losses will be empty.

In [None]:
@time res_train_only = JudiLing.get_and_train_model(
                                                    cue_obj_train.C,
                                                    S_train,
                                                    "../res/comp_train_only.bson", 
                                                    verbose=true)

Plot training loss.

In [None]:
plot(res_train_only.losses_train, xlab="epoch", ylab="loss", label = "training", size=(400, 200), 
    linecolor="black")

In [None]:
savefig("../fig/deep_training_only_bw.pdf")

Predict $\hat{\mathbf{S}}$ matrix for the training data.

In [None]:
Shat_train = JudiLing.predict_from_deep_model(res_train_only.model, cue_obj_train.C);

Calculate comprehension accuracy.

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

Predict semantic matrix for validation data.

In [None]:
Shat_val = JudiLing.predict_from_deep_model(res_train_only.model, cue_obj_val.C);

Compute comprehension accuracy

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

## Deep default model, training and validation data

Now, we also supply validation data, i.e. `cue_obj_val.C` and `S_val`. To compute validation accuracy after each epoch, the function also requires `data_train`, `data_val` as well as the target column `:Word`. The model is saved to `"../res/comp_full.bcomp"`, and losses are returned, this time also for the validation data.

In [None]:
@time res_full = JudiLing.get_and_train_model(
                                            cue_obj_train.C,
                                            S_train,
                                            cue_obj_val.C,
                                            S_val,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/comp_full.bson",
                                            verbose=true)

Plot losses.

In [None]:
plot(res_full.losses_train, xlab="epoch", ylab="loss", label = "training", size=(400,200), legend=:right,
linecolor="black")
plot!(res_full.losses_val, label = "validation",
linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_loss_bw.pdf")

Plot accuracy on the validation data.

In [None]:
plot(res_full.accs_val, xlab="epoch", ylab="accuracy", label="validation", size=(400,200),
linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_acc_bw.pdf")

Training accuracy:

In [None]:
Shat_train = JudiLing.predict_from_deep_model(res_full.model, cue_obj_train.C)
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

Validation accuracy:

In [None]:
Shat_val = JudiLing.predict_from_deep_model(res_full.model, cue_obj_val.C)
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

## Deep default model, optimising for accuracy

This time, we keep the same model as before, but now we retain the model with the best validation **accuracy** rather than with the lowest loss. To achieve this, we set `optimise_for_acc=true`. We also stop training if the validation accuracy has not improved after `early_stopping=20` epochs.

In [None]:
@time res_acc = JudiLing.get_and_train_model(cue_obj_train.C,
                                            S_train,
                                            cue_obj_val.C,
                                            S_val,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/comp_acc.bson",
                                            verbose=true,
                                            n_epochs=100,
                                            early_stopping=20,
                                            optimise_for_acc=true)

Plot losses.

In [None]:
plot(res_acc.losses_train, xlab="epoch", ylab="loss", label = "training", size=(400,200),
linecolor="black")
plot!(res_acc.losses_val, label = "validation", linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_acc_loss_bw.pdf")

Plot validation accuracies.

In [None]:
plot(res_acc.accs_val, xlab="epoch", ylab="accuracy", label="validation", size=(400,200),
    linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_acc_acc_bw.pdf")

Training accuracy.

In [None]:
Shat_train = JudiLing.predict_from_deep_model(res_acc.model, cue_obj_train.C)
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

Validation accuracy.

In [None]:
Shat_val = JudiLing.predict_from_deep_model(res_acc.model, cue_obj_val.C);
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

## Deep default model, optimising for accuracy, lower learning rate

Next, we keep the same setup as for the previous model, but experiment with the learning rate. For this, we first have to define an optimizer with the new learning rate:

In [None]:
optimizer_llr = Flux.Adam(0.0001)

Then we supply this optimizer to the `get_and_train_model` function:

In [None]:
@time res_acc_llr = JudiLing.get_and_train_model(
                                                cue_obj_train.C,
                                                S_train,
                                                cue_obj_val.C,
                                                S_val,
                                                data_train,
                                                data_val,
                                                :Word,
                                                "../res/comp_acc_llr.bson",
                                                verbose=true,
                                                n_epochs=100,
                                                early_stopping=20,
                                                optimise_for_acc=true,
                                                optimizer=optimizer_llr)

Plot losses.

In [None]:
plot(res_acc_llr.losses_train, xlab="epoch", ylab="loss", label = "training", size=(400,200),
linecolor="black")
plot!(res_acc_llr.losses_val, label = "validation", linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_llr_loss_bw.pdf")

Plot validation accuracies.

In [None]:
plot(res_acc_llr.accs_val, xlab="epoch", ylab="accuracy", label="validation", size=(400,200), linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_llr_acc_bw.pdf")

Training accuracy:

In [None]:
Shat_train = JudiLing.predict_from_deep_model(res_acc_llr.model, cue_obj_train.C)
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

Validation accuracy.

In [None]:
Shat_val = JudiLing.predict_from_deep_model(res_acc_llr.model, cue_obj_val.C);
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

## Deeper model

Finally, we experiment with using a deeper model. We have to define this model first in the following way:

In [None]:
model_deeper= Chain(Dense(size(cue_obj_train.C, 2) => 1000, relu),
            Dense(1000 => 1000, relu),
            Dense(1000 => size(S_train, 2)))

This model contains two hidden layers with a dimensionality of 1000 each, and each layer (except the last) is followed by a ReLU activation function.

We supply this model to the `model` parameter of `get_and_train_model`:

In [None]:
@time res_deeper = JudiLing.get_and_train_model(
                                                                cue_obj_train.C,
                                                                S_train,
                                                                cue_obj_val.C,
                                                                S_val,
                                                                data_train,
                                                                data_val,
                                                                :Word,
                                                                "../res/comp_deeper.bcomp",
                                                                return_losses=true, 
                                                                verbose=true,
                                                                model=model_deeper,
                                                                optimise_for_acc=true,
                                                                early_stopping=20)

Plot losses.

In [None]:
plot(res_deeper.losses_train, xlab="epoch", ylab="loss", label = "training", size=(400,200),
linecolor="black")
plot!(res_deeper.losses_val, label = "validation", linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_deeper_loss_bw.pdf")

Plot validation accuracies.

In [None]:
plot(res_deeper.accs_val, xlab="epoch", ylab="accuracy", label="validation", size=(400,200), linecolor="black", linestyle=:dash)

In [None]:
savefig("../fig/deep_full_deeper_acc_bw.pdf")

Training accuracy.

In [None]:
Shat_train = JudiLing.predict_from_deep_model(res_deeper.model, cue_obj_train.C)
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

Validation accuracy.

In [None]:
Shat_val = JudiLing.predict_from_deep_model(res_deeper.model, cue_obj_val.C)
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

# Production

## Baseline

Train production mapping.

In [None]:
@time G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C)

Evaluate train and validation correlation accuracy.

In [None]:
Chat_train_linear = S_train * G
JudiLing.eval_SC(Chat_train_linear, cue_obj_train.C, data_train, :Word)

In [None]:
Chat_val_linear = S_val * G
JudiLing.eval_SC(Chat_val_linear, cue_obj_val.C, cue_obj_train.C, data_val, data_train, :Word)

Run learn paths algorithm. First compute maximum number of time steps:

In [None]:
max_t = JudiLing.cal_max_timestep(data_train, data_val, "Word")

Now run `learn_paths` on the training data. We set the threshold to 0.01, and turn tolerance mode off.

In [None]:
prod_train_linear = JudiLing.learn_paths(
            data_train, # training dataset
            data_train, # validation dataset
            cue_obj_train.C, # form matrix for training data
            S_train, # targeted semantic matrix for validation data
            F, # comprehension model
            Chat_train_linear, # predicted form matrix for validation data
            cue_obj_train.A, # adjacency matrix for validation data
            cue_obj_train.i2f, # index-to-feature dictionary for training data
            cue_obj_train.f2i, # feature-to-index dictionary for training data
            max_t=max_t,
            threshold=0.01,
            grams=3,
            target_col="Word",
            verbose=true,
            is_tolerant = false)

Accuracy @1:

In [None]:
JudiLing.eval_acc(prod_train_linear, cue_obj_train)

Accuracy @10:

In [None]:
JudiLing.eval_acc_loose(prod_train_linear, cue_obj_train.gold_ind)

Run `learn_paths` on validation data. We set the threshold to 0.01, but allow one trigram with a lower threshold to be included (`is_tolerant=true` and `max_tolerance=1`).

In [None]:
prod_val_linear = JudiLing.learn_paths(
            data_train, # training dataset
            data_val, # validation dataset
            cue_obj_train.C, # form matrix for training data
            S_val, # targeted semantic matrix for validation data
            F, # comprehension model
            Chat_val_linear, # predicted form matrix for validation data
            cue_obj_val.A, # adjacency matrix for validation data
            cue_obj_train.i2f, # index-to-feature dictionary for training data
            cue_obj_train.f2i, # feature-to-index dictionary for training data
            max_t=max_t,
            threshold=0.01,
            grams=3,
            target_col="Word",
            verbose=true,
            is_tolerant = true,
            max_tolerance=1)

Accuracy @1:

In [None]:
JudiLing.eval_acc(prod_val_linear, cue_obj_val)

Accuracy @2:

In [None]:
JudiLing.eval_acc_loose(prod_val_linear, cue_obj_val.gold_ind)

## Deep default model

First, we have to define a model which ends with a sigmoid function:

In [None]:
model_prod = Chain(
            Dense(size(S_train, 2) => 1000, relu),   # activation function inside layer
            Dense(1000 => size(cue_obj_train.C, 2)),
            sigmoid) |> gpu    

Now we supply this model to the `get_and_train_model` function. Note that the order of the S and C matrices is now swapped: `S_train` is the training input, `cue_obj_train.C` the training target, and analogously for the validation data. The rest is the same as for the comprehension model, with the exception of the loss function, which we specify as `Flux.binarycrossentropy`. We use the model with the lowest validation loss and stop if it has not improved for 20 epochs.

In [None]:
@time res_prod = JudiLing.get_and_train_model(S_train,
                                            cue_obj_train.C,
                                            S_val,
                                            cue_obj_val.C,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/dutch_model_prod.bson",
                                            verbose=true,
                                            n_epochs=100,
                                            early_stopping=20,
                                            model=model_prod,
                                            loss_func=Flux.binarycrossentropy)

Training correlation accuracy:

In [None]:
Chat_train = JudiLing.predict_from_deep_model(res_prod.model, S_train)
JudiLing.eval_SC(Chat_train, cue_obj_train.C, data_train, :Word)

Validation correlation accuracy:

In [None]:
Chat_val = JudiLing.predict_from_deep_model(res_prod.model, S_val)
JudiLing.eval_SC(Chat_val, cue_obj_val.C, cue_obj_train.C, data_val, data_train, :Word)

## Production algorithm

### Combining deep production with linear comprehension mapping

In [None]:
max_t = JudiLing.cal_max_timestep(data_train, data_val, :Word)

In [None]:
prod_val = JudiLing.learn_paths(
            data_train, # training dataset
            data_val, # validation dataset
            cue_obj_train.C, # form matrix for training data
            S_val, # targeted semantic matrix for validation data
            F, # comprehension model
            Chat_val, # predicted form matrix for validation data
            cue_obj_val.A, # adjacency matrix for validation data
            cue_obj_train.i2f, # index-to-feature dictionary for training data
            cue_obj_train.f2i, # feature-to-index dictionary for training data
            max_t=max_t,
            threshold=0.01,
            grams=3,
            target_col="Word",
            verbose=true,
            is_tolerant = true,
            max_tolerance=1)

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

In [None]:
JudiLing.eval_acc_loose(prod_val, cue_obj_val.gold_ind)

In [None]:
prod_train = JudiLing.learn_paths(
            data_train, # training dataset
            data_train, # validation dataset
            cue_obj_train.C, # form matrix for training data
            S_train, # targeted semantic matrix for validation data
            F, # comprehension model
            Chat_train, # predicted form matrix for validation data
            cue_obj_train.A, # adjacency matrix for validation data
            cue_obj_train.i2f, # index-to-feature dictionary for training data
            cue_obj_train.f2i, # feature-to-index dictionary for training data
            max_t=max_t,
            threshold=0.01,
            grams=3,
            target_col="Word",
            verbose=true,
            is_tolerant = false)

In [None]:
JudiLing.eval_acc(prod_train, cue_obj_train)

### Combining deep production with deep comprehension

In [None]:
prod_val_deep = JudiLing.learn_paths(
            data_train, # training dataset
            data_val, # validation dataset
            cue_obj_train.C, # form matrix for training data
            S_val, # targeted semantic matrix for validation data
            res_acc.model, # comprehension model
            Chat_val, # predicted form matrix for validation data
            cue_obj_val.A, # adjacency matrix for validation data
            cue_obj_train.i2f, # index-to-feature dictionary for training data
            cue_obj_train.f2i, # feature-to-index dictionary for training data
            max_t=max_t,
            threshold=0.01,
            grams=3,
            target_col="Word",
            verbose=true,
            is_tolerant = true,
            max_tolerance=1)

In [None]:
JudiLing.eval_acc(prod_val_deep, cue_obj_val)

In [None]:
prod_train_deep = JudiLing.learn_paths(
            data_train, # training dataset
            data_train, # validation dataset
            cue_obj_train.C, # form matrix for training data
            S_train, # targeted semantic matrix for validation data
            res_acc.model, # comprehension model
            Chat_train, # predicted form matrix for validation data
            cue_obj_train.A, # adjacency matrix for validation data
            cue_obj_train.i2f, # index-to-feature dictionary for training data
            cue_obj_train.f2i, # feature-to-index dictionary for training data
            max_t=max_t,
            threshold=0.01,
            grams=3,
            target_col="Word",
            verbose=true,
            is_tolerant = false)

In [None]:
JudiLing.eval_acc(prod_train_deep, cue_obj_train)

How correlated are the predicted semantic vectors of candidates when using deep comprehension model rather than a linear comprehension model in synthesis by analysis?

In [None]:
using Statistics

Write the candidates produced by the learn_paths algorithm in the case of the linear model:

In [None]:
cand_dat = JudiLing.write2df(prod_val, data_val, cue_obj_train, cue_obj_val, target_col="Word")
cand_dat = cand_dat[.!ismissing.(cand_dat.pred),:]

And for the deep learning model:

In [None]:
cand_dat_deep = JudiLing.write2df(prod_val_deep, data_val, cue_obj_train, cue_obj_val, target_col="Word")
cand_dat_deep = cand_dat_deep[.!ismissing.(cand_dat_deep.pred),:]

Create cue objects for both the linear and the deep candidates:

In [None]:
cue_obj_cand_lin = JudiLing.make_cue_matrix(cand_dat, cue_obj_train, grams=3, target_col="pred");
cue_obj_cand_deep = JudiLing.make_cue_matrix(cand_dat_deep, cue_obj_train, grams=3, target_col="pred");

Get the predicted semantic matrices for both:

In [None]:
Shat_cand_lin = cue_obj_cand_lin.C * F;
Shat_cand_deep = JudiLing.predict_from_deep_model(res_acc.model, cue_obj_cand_deep.C);

Compute correlation between the candidates:

In [None]:
cor_shat_cand_lin = cor(Shat_cand_lin, dims=2);
cor_shat_cand_deep = cor(Shat_cand_deep, dims=2);

Compute correlation of the predicted semantic vectors with the best-supported candidate

In [None]:
cand_dat[!, :cor_with_best] .= 0.
for identifier in cand_dat[cand_dat.isbest .== true, :identifier]
    cor_subset = cor_shat_cand_lin[cand_dat.identifier .== identifier, cand_dat.identifier .== identifier]
    cand_dat[cand_dat.identifier .== identifier, :cor_with_best] = cor_subset[1,:]
end

In [None]:
cand_dat_deep[!, :cor_with_best] .= 0.
for identifier in cand_dat_deep[cand_dat_deep.isbest .== true, :identifier]
    cor_subset = cor_shat_cand_deep[cand_dat_deep.identifier .== identifier, cand_dat_deep.identifier .== identifier]
    cand_dat_deep[cand_dat_deep.identifier .== identifier, :cor_with_best] = cor_subset[1,:]
end

Compute average of the correlations between predicted semantic vectors and best-supported candidates:

In [None]:
mean(cand_dat[.!cand_dat.isbest, :cor_with_best])

In [None]:
mean(cand_dat_deep[.!cand_dat_deep.isbest, :cor_with_best])

Add this to a dataframe:

In [None]:
cand_dat[!,:comp] .= "linear"
cand_dat_deep[!,:comp] .= "deep"
cand_dat_all = vcat(cand_dat, cand_dat_deep)

Plot the distribution of correlations using a boxplot:

In [None]:
gr()

In [None]:
Pkg.add("StatsPlots")
using StatsPlots
default(fmt=:png)

In [None]:
@df cand_dat_all[.!cand_dat_all.isbest,:] boxplot(:comp, :cor_with_best, label="", ylab="Correlation",
title="Correlation of highest supported candidates with\nrespective alternative candidates")

In [None]:
savefig("../fig/deep_corr_alternatives.pdf")

## Frequency-informed Deep Discriminative Learning (FIDDL)

Load the full dutch dataset:

In [None]:
dutch = JudiLing.load_dataset("../dat/dutch.csv")

Scale down frequencies a bit to speed up training for demonstration purposes:

In [None]:
dutch[!, "Frequency_scaled10"] = dutch.Frequency./10;
dutch[!, "Frequency_scaled10"] = Int.(ceil.(dutch.Frequency_scaled10));

Load S matrix and create cue object:

In [None]:
dutch, S = JudiLing.load_S_matrix_from_fasttext(dutch, 
                                    :nl, 
                                    target_col=:Ortho);
cue_obj = JudiLing.make_cue_matrix(dutch, grams=3, target_col="Word");

Generate a learning sequence based on the scaled frequencies.

In [None]:
learn_seq = JudiLing.make_learn_seq(dutch.Frequency_scaled10;
                                           random_seed = 314);

In [None]:
length(learn_seq)

Train the FIDDL model.

In [None]:
res_fiddl = JudiLing.fiddl(cue_obj.C,
                S,
                learn_seq,
                dutch,
                :Word,
                "../res/dutch_fiddl_comp.bson";
                hidden_dim=1000,
                batchsize=512,
                verbose=true,
                n_batch_eval=100)

Plot the accuracies across learning steps:

In [None]:
evaluation_steps = [i*512*100 for i in 1:length(res_fiddl.accs)]
plot(evaluation_steps, res_fiddl.accs, xlab="Learning step", ylab="Accuracy", legend=false, linecolor="black")

In [None]:
savefig("../fig/fiddl_acc_bw.pdf")

Compute the final accuracy:

In [None]:
Shat = JudiLing.predict_from_deep_model(res_fiddl.model, cue_obj.C);
JudiLing.eval_SC(Shat, S, dutch, :Word)

Plot correlations with target vectors against frequency:

In [None]:
using LinearAlgebra

target_correlations = diag(cor(Shat, S, dims=2))

scatter(log.(dutch.Frequency), target_correlations, xlab="Log Frequency", ylab="Target Correlation")

In [None]:
savefig("../fig/fiddl_cor_vs_freq.pdf")

There is a clear relationship between frequency and target correlation.

## Exercises

Create a random data split for the latin dataset:

In [None]:
data_train, data_val = JudiLing.loading_data_randomly_split(
     "../dat/latin.csv", "../dat/cv_random", "latin",
     val_sample_size = 50,
     random_seed = 42);

Create C and S matrices

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_combined_cue_matrix(data_train,
                                   data_val,
                                   grams=3,
                                   target_col="Word");

In [None]:
S_train, S_val = JudiLing.make_combined_S_matrix(
                           data_train,
                           data_val,
                           ["Lexeme"],
                           ["Person", "Number", "Tense", "Voice", "Mood"],
                           ncol=300);

Train a DDL comprehension model with one hidden layer with a dimensionality of 500 on the training data for 100 epochs

In [None]:
latin_res_train_only = JudiLing.get_and_train_model(cue_obj_train.C,
                                                    S_train,
                                                    "../res/latin_comp_train_only.bson", 
                                                    verbose=true)

In [None]:
Shat_train = JudiLing.predict_from_deep_model(latin_res_train_only.model, cue_obj_train.C)
Shat_val = JudiLing.predict_from_deep_model(latin_res_train_only.model, cue_obj_val.C)

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

Now we also supply the validation data to the function, and stop the training after the loss stops improving, and after the accuracy stops improving:

In [None]:
latin_res_loss = JudiLing.get_and_train_model(cue_obj_train.C,
                                            S_train,
                                            cue_obj_val.C,
                                            S_val,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/latin_comp_acc.bson",
                                            verbose=true,
                                            n_epochs=100,
                                            early_stopping=20)

In [None]:
Shat_train = JudiLing.predict_from_deep_model(latin_res_loss.model, cue_obj_train.C)
Shat_val = JudiLing.predict_from_deep_model(latin_res_loss.model, cue_obj_val.C)

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

In [None]:
latin_res_acc = JudiLing.get_and_train_model(cue_obj_train.C,
                                            S_train,
                                            cue_obj_val.C,
                                            S_val,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/comp_acc.bson",
                                            verbose=true,
                                            n_epochs=100,
                                            early_stopping=20,
                                            optimise_for_acc=true)

In [None]:
Shat_train = JudiLing.predict_from_deep_model(latin_res_acc.model, cue_obj_train.C)
Shat_val = JudiLing.predict_from_deep_model(latin_res_acc.model, cue_obj_val.C)

In [None]:
JudiLing.eval_SC(Shat_train, S_train, data_train, :Word)

In [None]:
JudiLing.eval_SC(Shat_val, S_val, S_train, data_val, data_train, :Word)

The accuracies are very similar across the two training runs, presumably because the dataset is small and quite regular due to the semantic vectors being simulated. Interestingly, the relationship between form and meaning seems to be so regular in this simulated example that a model overfitted to the training data (the `latin_res_train_only.model`) still generalises very well to the validation data.

Training a production model with MSE loss:

In [None]:
latin_model_prod = Chain(
            Dense(size(S_train, 2) => 500, relu),   # activation function inside layer
            Dense(500 => 500, relu),
            Dense(500 => size(cue_obj_train.C, 2))) |> gpu    

In [None]:
latin_res_prod_mse = JudiLing.get_and_train_model(S_train,
                                            cue_obj_train.C,
                                            S_val,
                                            cue_obj_val.C,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/latin_prod.bson",
                                            verbose=true,
                                            n_epochs=100,
                                            early_stopping=20,
                                            optimise_for_acc=true,
                                            model=latin_model_prod,
                                            loss_func=Flux.mse)

In [None]:
Chat_train = JudiLing.predict_from_deep_model(latin_res_prod_mse.model, S_train)
Chat_val = JudiLing.predict_from_deep_model(latin_res_prod_mse.model, S_val)

In [None]:
JudiLing.eval_SC(Chat_train, cue_obj_train.C, data_train, :Word)

In [None]:
JudiLing.eval_SC(Chat_val, cue_obj_val.C, cue_obj_train.C,data_val, data_train, :Word)

Train a production model with binary cross entropy loss

In [None]:
latin_model_prod = Chain(
            Dense(size(S_train, 2) => 500, relu),   # activation function inside layer
            Dense(500 => 500, relu),
            Dense(500 => size(cue_obj_train.C, 2)),
            sigmoid) |> gpu    

In [None]:
latin_res_prod_bce = JudiLing.get_and_train_model(S_train,
                                            cue_obj_train.C,
                                            S_val,
                                            cue_obj_val.C,
                                            data_train,
                                            data_val,
                                            :Word,
                                            "../res/latin_prod.bson",
                                            verbose=true,
                                            n_epochs=100,
                                            early_stopping=20,
                                            optimise_for_acc=true,
                                            model=latin_model_prod,
                                            loss_func=Flux.binarycrossentropy)

In [None]:
Chat_train = JudiLing.predict_from_deep_model(latin_res_prod_bce.model, S_train)
Chat_val = JudiLing.predict_from_deep_model(latin_res_prod_bce.model, S_val)

In [None]:
JudiLing.eval_SC(Chat_train, cue_obj_train.C, data_train, :Word)

In [None]:
JudiLing.eval_SC(Chat_val, cue_obj_val.C, cue_obj_train.C,data_val, data_train, :Word)

The loss function makes a huge difference: High accuracies are only achieved with the binary crossentropy loss. 

Running learn_paths with the best production and comprehension models. In this notebook, the loss-based comprehension model was best and we therefore use that one:

In [None]:
JudiLing.cal_max_timestep(data_train, data_val, "Word")

In [None]:
prod_val = JudiLing.learn_paths(
        data_train,            # training dataset
        data_val,              # validation dataset
        cue_obj_train.C,       # form matrix for training data
        S_val,                 # targeted semantic matrix for validation data
        latin_res_loss.model,                     # comprehension mapping
        Chat_val,              # predicted form matrix for validation data
        cue_obj_val.A,         # adjacency matrix for validation data
        cue_obj_train.i2f,     # index-to-feature dictionary for training data 
        cue_obj_train.f2i,     # feature-to-index dictionary for training data
        max_t=16,
        threshold=0.001,
        grams=3,
        target_col="Word",
        verbose=true);

In [None]:
JudiLing.eval_acc(prod_val, cue_obj_val)

The accuracy is substantially higher than with linear mappings (42%).

Training a FIDDL model:

In [None]:
latin = DataFrame(CSV.File("../dat/latin.csv"))

In [None]:
cue_obj = JudiLing.make_cue_matrix(latin, grams=3, target_col="Word");
S = JudiLing.make_S_matrix(
                           latin,
                           ["Lexeme"],
                           ["Person", "Number", "Tense", "Voice", "Mood"],
                           ncol=300);

In [None]:
learn_seq = JudiLing.make_learn_seq(latin.sim_freq;
                                           random_seed = 314);

Train with batchsize of 32:

In [None]:
latin_res_fiddl32 = JudiLing.fiddl(cue_obj.C,
                S,
                learn_seq,
                latin,
                :Word,
                "../res/latin_fiddl_comp.bson";
                hidden_dim=1000,
                batchsize=32,
                verbose=true,
                n_batch_eval=10)

In [None]:
Shat = JudiLing.predict_from_deep_model(latin_res_fiddl32.model, cue_obj.C)
JudiLing.eval_SC(Shat, S, latin, :Word)

Train with a batchsize of 128:

In [None]:
latin_res_fiddl128 = JudiLing.fiddl(cue_obj.C,
                S,
                learn_seq,
                latin,
                :Word,
                "../res/latin_fiddl_comp.bson";
                hidden_dim=1000,
                batchsize=128,
                verbose=true,
                n_batch_eval=10)

In [None]:
Shat = JudiLing.predict_from_deep_model(latin_res_fiddl128.model, cue_obj.C)
JudiLing.eval_SC(Shat, S, latin, :Word)

With a smaller batchsize the accuracy is significantly higher. The reason for this is presumably that there seems to be a relationship between learning rate and batchsize and their impact on the loss (see Jastrzębski et al, 2018, though note that their experiments are done with SGD rather than Adam). This essentially means that if the batchsize is increased, the learning rate should also be increased. If we increase the learning rate to 0.004, the accuracy is again quite similar to the run with batchsize=32 and learning rate=0.001:

In [None]:
latin_res_fiddl128_hlr = JudiLing.fiddl(cue_obj.C,
                S,
                learn_seq,
                latin,
                :Word,
                "../res/latin_fiddl_comp.bson";
                hidden_dim=1000,
                batchsize=128,
                verbose=true,
                n_batch_eval=10,
                optimizer=Flux.Adam(0.004))

In [None]:
Shat = JudiLing.predict_from_deep_model(latin_res_fiddl128_hlr.model, cue_obj.C)
JudiLing.eval_SC(Shat, S, latin, :Word)

## References

Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018). Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27 (pp. 392-402). Springer International Publishing.