# Chapter 12.9: English past tense

In [None]:
using JudiLing
using DataFrames, Statistics, CSV

**Note**: This notebook was run with Julia 1.11. Different Julia versions may lead to different random word selections and therefore, not all of the following code may run when using a different Julia version.

# Preparation

First, we have to load the English dataset. We only want to have past tense forms in the heldout (validation) data, but because there is no function for that specific purpose available, we will first use the careful split function to make sure that the validation data only contains words whose lexeme, aspect, tense, person and number as well as all trigrams have already occurred in the training data:

In [None]:
data_train, data_val =
JudiLing.loading_data_careful_split(
"../dat/english.csv", "english", "../dat/careful",
["Lexeme", "Continuous", "Tense", "Person", "Number"],
n_grams_target_col = "Word",
grams = 3,
val_sample_size = 300,
random_seed = 42)

Then, we subset the validation data to only contain past tense words, and put the rest back into the training data.

In [None]:
data_train = vcat(data_train, data_val[data_val.Tense .!= "past",:])
data_val = data_val[data_val.Tense .== "past",:]

Now we can inspect how many regular and irregular verbs there are in the validation data.

In [None]:
combine(groupby(data_val, :Regularity), nrow)

The last preparation step is to create semantic matrices for the training and validation data. We load them from fasttext.

In [None]:
train_small, val_small, S_train, S_val = JudiLing.load_S_matrix_from_fasttext(data_train, data_val, :en, target_col=:Word)

# Simulation 1: meaning-form mapping

In the first simulation, we use the embeddings of the heldout past tense words to predict their forms. First, we require cue matrices for both the training and the heldout data.

In [None]:
cue_obj_train, cue_obj_val = JudiLing.make_cue_matrix(train_small, val_small, 
    grams=3, target_col="Word");

Next, we train an F matrix on the training data, and calculate the Shat matrix to evaluate it.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
Shat = cue_obj_train.C * F;

In [None]:
JudiLing.eval_SC(Shat, S_train, train_small, :Word)

Same for the production matrix.

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C);
Chat_train = S_train * G;

In [None]:
JudiLing.eval_SC(Chat_train, cue_obj_train.C, train_small, :Word)

Run learn paths on the training data.

In [None]:
res_learn_train= JudiLing.learn_paths(
train_small,
cue_obj_train,
S_train,
F,
Chat_train,
threshold = 0.01,
verbose = true,
);

Accuracy on the training data.

In [None]:
JudiLing.eval_acc(res_learn_train, cue_obj_train)

Accuracy @10.

In [None]:
JudiLing.eval_acc_loose(res_learn_train, cue_obj_train.gold_ind)

Create dataframe with all productions, and join it with the original training dataframe.

In [None]:
df_train = JudiLing.write2df(res_learn_train, train_small, cue_obj_train, cue_obj_train, target_col=:Word)
df_train = leftjoin(df_train, train_small, on = :identifier => :Word)
df_train

Subset to only contain rows with past tense targets, and only the first candidates.

In [None]:
past_tense_cands = df_train[(df_train.isbest .== true) .& (df_train.Tense .== "past"),:]

Accuracy for regulars and irregulars.

In [None]:
combine(groupby(past_tense_cands, :Regularity), :iscorrect => mean)

Errors

In [None]:
past_tense_cands[past_tense_cands.iscorrect .== false,:]

In [None]:
CSV.write("../res/past_tense_errors.csv", past_tense_cands[past_tense_cands.iscorrect .== false,:])

- no change: 5
- semantic: 39
- tense: 2
- overregularisation: 8
- overirregularisation: 4
- other: 7

Production of heldout forms.
First compute Chat matrix.

In [None]:
Chat_val = S_val * G;

Then run learn paths algorithm.

In [None]:
res_learn_val = JudiLing.learn_paths(
train_small,
val_small,
cue_obj_train.C,
S_val,
F,
Chat_val,
cue_obj_train.A,
cue_obj_train.i2f,
cue_obj_train.f2i, # api changed in 0.3.1
max_t = JudiLing.cal_max_timestep(val_small, :Word),
max_can = 10,
grams = 3,
threshold = 0.01,
is_tolerant=true,
max_tolerance=1,
tolerance=-1.,
target_col = :Word,
verbose = true,
);

Accuracy

In [None]:
JudiLing.eval_acc(res_learn_val, cue_obj_val)

Accuracy @10

In [None]:
JudiLing.eval_acc_loose(res_learn_val, cue_obj_val.gold_ind)

# Simulation 2: form-meaning-form mapping

For the second simulation, we will first map the base forms of the heldout past tense forms to predict their semantics, then add a past tense vector, and then use these created semantic vectors to predict past tense forms.

Load the base forms of the past tense forms with their phonology.

In [None]:
base = JudiLing.load_dataset("../dat/english_heldout_base_orth2.csv")

Now we require three cue matrices: one for the training data, one for the base forms, and one for the heldout forms (for verifying the produced forms).

In [None]:
cue_obj_train, cue_obj_base = JudiLing.make_combined_cue_matrix(train_small[:, ["Word"]], base[:, ["Word"]], 
 grams=3, target_col="Word")

cue_obj_val = JudiLing.make_cue_matrix(val_small[:, ["Word"]], cue_obj_train,
 grams=3, target_col="Word")

Train the F matrix and predict semantic vectors for the training and the base forms.

In [None]:
F = JudiLing.make_transform_matrix(cue_obj_train.C, S_train)
Shat_train = cue_obj_train.C * F

In [None]:
JudiLing.eval_SC(Shat_train, S_train, train_small, :Word)

Now we create semantic vectors for the heldout forms.
First, we require predicted vectors for the base forms:

In [None]:
Shat_base = cue_obj_base.C * F

Next, we impute vectors for all features in the `:features` column in our data:

In [None]:
L = JudiLing.make_pS_matrix(train_small, features_col = :features);

In [None]:
W = JudiLing.make_transform_matrix(L.pS, S_train);

This way, we get a past tense vector, which we now add to the base vectors.

In [None]:
past_vec = W[L.f2i["past"],:]

S_base_past = Shat_base .+ past_vec'

Now, we first train the G matrix, and then predict Chat matrices for the training data, as well as for the heldout data based on the semantic vectors we just created.

In [None]:
G = JudiLing.make_transform_matrix(S_train, cue_obj_train.C);
Chat = S_train * G;
Chat_val_base_past = S_base_past * G;

Now we run the learn paths algorithm on the heldout data. Note that we now pass the predicted C matrix based on the created semantic vectors (`Chat_val_base_past`) as well as those semantic vectors (`S_base_past`).

In [None]:
res_learn_base_past= JudiLing.learn_paths(
train_small,
val_small,
cue_obj_train.C,
S_base_past,
F,
Chat_val_base_past,
cue_obj_train.A,
cue_obj_train.i2f,
cue_obj_train.f2i, # api changed in 0.3.1
max_t = JudiLing.cal_max_timestep(val_small, :Word),
max_can = 10,
grams = 3,
threshold = 0.01,
is_tolerant=true,
max_tolerance=1,
tolerance=-1.,
target_col = :Word,
verbose = true,
);

Accuracy:

In [None]:
JudiLing.eval_acc(res_learn_base_past, cue_obj_val)

Accuracy @10

In [None]:
JudiLing.eval_acc_loose(res_learn_base_past, cue_obj_val.gold_ind)

Write to dataframe and join with validation dataframe

In [None]:
df_base_past = JudiLing.write2df(res_learn_base_past, val_small, cue_obj_train, cue_obj_val, target_col=:Word)
df_base_past = leftjoin(df_base_past, val_small, on = :identifier => :Word)
df_base_past[df_base_past.isbest .== true,:]

In [None]:
best = df_base_past[df_base_past.isbest .== true,:]
best[:, ["identifier", "pred", "Regularity"]]

Compute accuracy for regular and irregular verbs.

In [None]:
combine(groupby(df_base_past[df_base_past.isbest .== true,:], :Regularity), :iscorrect => mean)

# Exercises
## Exercise 1: 
Rerun the second analysis (using past tense vectors created on the fly) using phonological representations. Note that you will have to create a new careful split. Use `random_seed = 42`. A dataframe with base forms can be found in `dat/english_heldout_base.csv`. How do the results change?

Split the data using random seed 42 and this time using `"Phon"` as the target column.

In [None]:
data_train_phon, data_val_phon =
JudiLing.loading_data_careful_split(
"../dat/english.csv", "english_phon", "../dat/careful",
["Lexeme", "Continuous", "Tense", "Person", "Number"],
n_grams_target_col = "Phon",
grams = 3,
val_sample_size = 300,
random_seed = 42)

Keep all past tense forms in the validation data and merge the rest back into the training data.

In [None]:
data_train_phon = vcat(data_train_phon, data_val_phon[data_val_phon.Tense .!= "past",:])
data_val_phon = data_val_phon[data_val_phon.Tense .== "past",:]

In [None]:
combine(groupby(data_val_phon, :Regularity), nrow)

Load the dataframe with baseforms.

In [None]:
base_phon = DataFrame(CSV.File("../dat/english_heldout_base_phon2.csv"))

Load semantic vectors for the words in the training data.

In [None]:
train_small_phon, S_train_phon = JudiLing.load_S_matrix_from_fasttext(data_train_phon, :en, target_col=:Word)

Create cue matrices.

In [None]:
cue_obj_train_phon, cue_obj_base_phon = JudiLing.make_combined_cue_matrix(train_small_phon[:, ["Phon"]], base_phon[:, ["Phon"]], 
 grams=3, target_col="Phon")

cue_obj_val_phon = JudiLing.make_cue_matrix(data_val_phon[:, ["Phon"]], cue_obj_train_phon,
 grams=3, target_col="Phon")

Train F matrices, predict semantic matrix and evaluate.

In [None]:
F_phon = JudiLing.make_transform_matrix(cue_obj_train_phon.C, S_train_phon)
Shat_train_phon = cue_obj_train_phon.C * F_phon

In [None]:
JudiLing.eval_SC(Shat_train_phon, S_train_phon, train_small_phon, :Phon)

Predict semantic vectors for the base forms

In [None]:
Shat_base_phon = cue_obj_base_phon.C * F_phon

Create past tense form semantic vectors.

In [None]:
L_phon = JudiLing.make_pS_matrix(train_small_phon, features_col = :features);
W_phon = JudiLing.make_transform_matrix(L_phon.pS, S_train_phon);
past_vec_phon = W_phon[L_phon.f2i["past"],:]

In [None]:
S_base_past_phon = Shat_base_phon .+ past_vec_phon'

Train production matrix and predict.

In [None]:
G_phon = JudiLing.make_transform_matrix(S_train_phon, cue_obj_train_phon.C);
Chat_phon = S_train_phon * G_phon;
Chat_val_base_past_phon = S_base_past_phon * G_phon;

Run learn paths.

In [None]:
res_learn_base_past_phon= JudiLing.learn_paths(
train_small_phon,
data_val_phon,
cue_obj_train_phon.C,
S_base_past_phon,
F_phon,
Chat_val_base_past_phon,
cue_obj_train_phon.A,
cue_obj_train_phon.i2f,
cue_obj_train_phon.f2i, # api changed in 0.3.1
max_t = JudiLing.cal_max_timestep(data_val_phon, :Phon),
max_can = 10,
grams = 3,
threshold = 0.01,
is_tolerant=true,
max_tolerance=1,
tolerance=-1.,
target_col = :Phon,
verbose = true,
);

Accuracy.

In [None]:
JudiLing.eval_acc(res_learn_base_past_phon, cue_obj_val_phon)

Write to dataframe, join with full validation dataframe and display best supported candidates.

In [None]:
df_base_past_phon = JudiLing.write2df(res_learn_base_past_phon, data_val_phon, cue_obj_train_phon, cue_obj_val_phon, target_col=:Phon)
df_base_past_phon = leftjoin(df_base_past_phon, data_val_phon, on = :identifier => :Phon)
df_base_past_phon[df_base_past_phon.isbest .== true,:]

In [None]:
last(df_base_past_phon[df_base_past_phon.isbest .== true,:],10)

Accuracy for regulars and irregulars.

In [None]:
combine(groupby(df_base_past_phon[df_base_past_phon.isbest .== true,:], :Regularity), :iscorrect => mean)

Conclusions:

In the case of past tense forms of phonological forms the model with past tense vectors created on the fly does not perform well (you can try running the very first analysis with past tense vectors from the embedding space as input; the results are similar). One possible reason for this drop in accuracy compared to orthographic representations could be that the regular orthographic representations are much more regular (always ending in "ed") compared to the phonological ones (sometimes ending in `d`, sometimes in `t`). This is particularly important when considering the trigram representation, where `ed#` forms one trigram while this is usually not the case for the phonological representation (try rerunning this analysis using biphones instead of triphones and you will see that the result improves quite dramatically. Just make sure you split the data still based on triphones, otherwise `english_heldout_base.csv` won't match the held-out data).

## Exercise 2
Rerun the first analysis (i.e. model using trigrams rather than triphones), but instead of making use of real past tense forms as the heldout data, use the following list of nonwords (from Albright & Hayes, 2003): bize, dize, flidge, fro, gare, glip, rife, stin, stip, blafe, bredge, chool, dape, gezz, nace, spack, stire, tesh, wiss, blig, chake, drit, fleep, gleed, glit, plim, queed, scride, spling, gude, nold, nung, pank, preak, rask, shilk, tark, teep, trisk, tunk.

Is the model able to predict plausible past tense forms of the nonwords? What are problems the model has with these forms that it didn't have in the previous analyses?

First, we create a dataframe with all the nonwords in a column called `:Word` to match the column name in the training data.

In [None]:
nonwords = "bize, dize, flidge, fro, gare, glip, rife, stin, stip, blafe, bredge, chool, dape, gezz, nace, spack, stire, tesh, wiss, blig, chake, drit, fleep, gleed, glit, plim, queed, scride, spling, gude, nold, nung, pank, preak, rask, shilk, tark, teep, trisk, tunk"
nonwords = split(nonwords, ", ")
nonwords_df = DataFrame(:Word => nonwords)

Now we create a dataframe for the training data and the nonwords. We can't create one for the heldout past tense forms because there are no "correct" past tense forms for these nonwords.

In [None]:
cue_obj_train_orth_nw, cue_obj_base_orth_nw = JudiLing.make_combined_cue_matrix(train_small[:, ["Word"]], nonwords_df[:, ["Word"]], 
 grams=3, target_col="Word")

Train F matrix.

In [None]:
F_orth_nw = JudiLing.make_transform_matrix(cue_obj_train_orth_nw.C, S_train)

Predict semantic vectors for the nonword base forms and add past tense vector.

In [None]:
Shat_base_orth_nw = cue_obj_base_orth_nw.C * F_orth_nw

In [None]:
S_base_past_orth_nw = Shat_base_orth_nw .+ past_vec'

Train G matrix and predict form vectors for nonword past tense forms.

In [None]:
G_orth_nw = JudiLing.make_transform_matrix(S_train, cue_obj_train_orth_nw.C);
Chat_val_base_past_orth_nw = S_base_past_orth_nw * G_orth_nw;

Run learn paths for the nonwords.

In [None]:
res_learn_base_past_orth_nw= JudiLing.learn_paths(
train_small,
nonwords_df,
cue_obj_train_orth_nw.C,
S_base_past_orth_nw,
F_orth_nw,
Chat_val_base_past_orth_nw,
cue_obj_train_orth_nw.A,
cue_obj_train_orth_nw.i2f,
cue_obj_train_orth_nw.f2i, # api changed in 0.3.1
max_t = JudiLing.cal_max_timestep(nonwords_df, :Word),
max_can = 10,
grams = 3,
threshold = 0.01,
is_tolerant=true,
max_tolerance=1,
tolerance=-1.,
target_col = :Word,
verbose = true,
);

Since there are no target forms for these nonwords, we can't compute any accuracy. Therefore we can only inspect the results

In [None]:
df_base_past_orth_nw = JudiLing.write2df(res_learn_base_past_orth_nw, nonwords_df, cue_obj_train_orth_nw, cue_obj_base_orth_nw, target_col=:Word)
df_base_past_orth_nw = leftjoin(df_base_past_orth_nw, nonwords_df, on = :identifier => :Word)
first(df_base_past_orth_nw[df_base_past_orth_nw.isbest .== true,:], 20)

In [None]:
last(df_base_past_orth_nw[df_base_past_orth_nw.isbest .== true,:], 20)

Conclusions:

Overall, performance on these nonwords is somewhat worse than on the real words. Part of the reason is surely that some of the trigrams in the nonwords are not available in the training data, and that therefore the production model has a much harder task to solve. 
The most frequent "error" is again the "no change" error. There are a few instances where this might be a plausible past form such as for "queed" or "gleed". There are no plausible irregular-like stem changes. However, there are a few completely implausible forms, such as "dinged" for "gude" or "crided" for "scride". These are presumably due to missing trigrams as mentioned above. A good aspect of the produced forms is that they capture regularities such as the duplication of "t"s in words such as "drit" => "dritted" or "glit" => "glitted".

Overall, the model seems to be able to deal with nonwords reasonably well, but it is also evident that performance will suffer if the words are phonologically or orthographically implausible (here the case if trigrams are missing). One way to improve this may be to use bigrams instead of trigrams.