# Chapter 13.2: Mandarin Chinese

In [None]:
using JudiLing, DataFrames

Loading the data

In [None]:
mandarin = JudiLing.load_dataset("../dat/mandarin.csv")

## Using the phonological representation

Creating the cue object:

In [None]:
cue_obj = JudiLing.make_cue_matrix(mandarin, 
                                    grams=3, 
                                    tokenized=true, 
                                    sep_token=".", 
                                    keep_sep=true,
                                    target_col = :phones);

In [None]:
JudiLing.display_matrix(mandarin, :phones, cue_obj, cue_obj.C, :C)

Loading the S matrix:

In [None]:
S, words = JudiLing.load_S_matrix("../dat/S_mandarin.txt")

In [None]:
JudiLing.display_matrix(mandarin, :phones, cue_obj, S, :S)

Comprehension:

In [None]:
F = JudiLing.make_transform_matrix(cue_obj.C, S);
Shat = cue_obj.C * F;

In [None]:
JudiLing.eval_SC(Shat, S, mandarin, :phones)

Production:

In [None]:
G = JudiLing.make_transform_matrix(S, cue_obj.C);
Chat = S * G;

In [None]:
res = JudiLing.learn_paths(mandarin, 
                            cue_obj, 
                            S, 
                            F, 
                            Chat, 
                            threshold=0.01,
                            Shat_val=Shat);

In [None]:
JudiLing.eval_acc(res, cue_obj)

## Using the character representation

Creating the cue object:

In [None]:
cue_obj_char = JudiLing.make_cue_matrix(mandarin, grams=2, target_col = :word)

In [None]:
JudiLing.display_matrix(mandarin, :word, cue_obj_char, cue_obj_char.C, :C)

Comprehension:

In [None]:
F_char = JudiLing.make_transform_matrix(cue_obj_char.C, S);
Shat_char = cue_obj_char.C * F_char;

In [None]:
JudiLing.eval_SC(Shat_char, S, mandarin, :word)

Production:

In [None]:
G_char = JudiLing.make_transform_matrix(S, cue_obj_char.C);
Chat_char = S * G_char;

In [None]:
res_char = JudiLing.learn_paths(mandarin, 
                                cue_obj_char, 
                                S, 
                                F_char, 
                                Chat_char, 
                                threshold=0.01,
                                Shat_val=Shat_char);

In [None]:
JudiLing.eval_acc(res_char, cue_obj_char)

Analysing the productions:

In [None]:
df = JudiLing.write2df(res_char, mandarin, cue_obj_char, cue_obj_char, target_col=:word);
df_isbest = df[ismissing.(df.isbest) .| (df.isbest .== 1),:]

In [None]:
df_isbest[ismissing.(df_isbest.iscorrect) .| (df_isbest.iscorrect .== 0),:]

## Exercises

Comprehension without tone markers

In [None]:
# Creating a new column without tone markers
mandarin[!,"phones_no_tones"] = [replace(phone, r"[1-9]" => "") for phone in mandarin.phones]

In [None]:
cue_obj_no_tones = JudiLing.make_cue_matrix(mandarin, 
                                    grams=3, 
                                    tokenized=true, 
                                    sep_token=".", 
                                    keep_sep=true,
                                    target_col = :phones_no_tones);

In [None]:
F_no_tones = JudiLing.make_transform_matrix(cue_obj_no_tones.C, S);
Shat_no_tones = cue_obj_no_tones.C * F_no_tones
JudiLing.eval_SC(Shat_no_tones, S, mandarin, :word)

Compared to the 98.8% accuracy we got with tones this is a significant reduction in accuracy.

The same but without single syllable words:

In [None]:
mandarin[!, "char_num"] = length.(mandarin.word);

In [None]:
mandarin_multi = mandarin[mandarin.char_num .> 1,:]

In [None]:
cue_obj_no_tones_multi = JudiLing.make_cue_matrix(mandarin_multi, 
                                    grams=3, 
                                    tokenized=true, 
                                    sep_token=".", 
                                    keep_sep=true,
                                    target_col = :phones_no_tones);

In [None]:
S_multi = S[mandarin.char_num .> 1,:];

In [None]:
F_no_tones_multi = JudiLing.make_transform_matrix(cue_obj_no_tones_multi.C, S_multi);
Shat_no_tones_multi = cue_obj_no_tones_multi.C * F_no_tones_multi
JudiLing.eval_SC(Shat_no_tones_multi, S_multi, mandarin_multi, :word)

Without single syllable words, accuracy is again clearly higher. This suggests that tone is particularly important for single-character words.

Comparing the evaluation accuracy of two mappings, one with characters and one with phones, using strict evaluation:

In [None]:
JudiLing.eval_SC(Shat,S)

In [None]:
JudiLing.eval_SC(Shat_char,S)

Strict evaluation gives higher accuracy to the character-based mapping than the phone-based one. Since we know that under lenient evaluation they perform similarly, this implies that there are more homophones than homographs.

In [None]:
length(unique(mandarin.word))

In [None]:
length(unique(mandarin.phones))

There are indeed fewer unique phone-representations than character-representations in the dataset.