# Baseline models for reconstruction: Random guessing and Zero-Rule:

To create baselines for reconstructing Middle Chinese pronunciations, we will first experiment with random guessing, and later move onto the Zero Rule (0R).

In [1]:
import numpy as np
import pandas as pd
np.random.seed(42)

Load data:

In [2]:
matrix = pd.read_csv('model/data-matrix-karlgren.csv').set_index('character')

In [3]:
matrix.head()

Unnamed: 0_level_0,mando_onset_b,mando_onset_c,mando_onset_ch,mando_onset_d,mando_onset_f,mando_onset_g,mando_onset_h,mando_onset_j,mando_onset_k,mando_onset_l,...,Karlgren_nucleus_ə̯u,Karlgren_nucleus_ɨ̯ɐ,Karlgren_coda_k̚,Karlgren_coda_m,Karlgren_coda_n,Karlgren_coda_p̚,Karlgren_coda_t̚,Karlgren_coda_ŋ,Karlgren_coda_̯,Karlgren_coda_∅
character,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
㐁,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
㐆,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
㐭,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0
㐱,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
㐲,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Let's take a look at the entry for 算 ('compute'):

In [4]:
entry = matrix.loc['算']
indices = list(entry[entry == 1].index)
print('Query: 算')
print('- In Mandarin:', ' + '.join([x[x.rfind('_') + 1:] for x in indices if 'mand' in x][:-1]))
print('- In Korean:', ' + '.join([x[x.rfind('_') + 1:] for x in indices if 'kor' in x]))
print('- Karlgren\'s reconstruction:', ' + '.join([x[x.rfind('_') + 1:] for x in indices if 'Karlgren' in x]))

Query: 算
- In Mandarin: s + ua + n
- In Korean: ㅅ + ㅏ + ㄴ
- Karlgren's reconstruction: s + uɑ + n


## Random guessing:

For each character, we want to predict the reconstructed onset, nucleus, etc. For each of these syllabic elements, we simply try to guess the correct label out of all possible labels.

Here are some of the onsets from Karlgren's scheme: **bʱ dʱ d͡zʱ ... ʔ ∅**

For each character we just random guess an onset from the list above.

In [5]:
cols = list(matrix.columns)
tone_label_names = [x for x in cols if 'tone_label' in x]
onset_label_names = [x for x in cols if 'Karlgren_onset' in x]
nucleus_label_names = [x for x in cols if 'Karlgren_nucleus' in x]
coda_label_names = [x for x in cols if 'Karlgren_coda' in x]

X = matrix.drop([
    *tone_label_names,
    *onset_label_names,
    *nucleus_label_names,
    *coda_label_names
], axis=1)

In [6]:
n_examples = matrix.shape[0]
for label_class, name in [
    (tone_label_names, 'Tone'),
    (onset_label_names, 'Onset'),
    (nucleus_label_names, 'Nucleus'),
    (coda_label_names, 'Coda')
]:
    n_categories = len(label_class)
    pred_matrix = np.zeros((n_categories, 1))
    pred_matrix[np.random.randint(0, n_categories)] = 1
    
    for row in range(n_examples - 1):
        new_row = np.zeros((n_categories, 1))
        new_row[np.random.randint(0, n_categories)] = 1
        pred_matrix = np.c_[pred_matrix, new_row]
    
    acc = np.sum(np.sum(matrix[label_class] * pred_matrix.T)) / n_examples
    print('Accuracy on {}: {:.2f}%'.format(name, 100 * acc))

Accuracy on Tone: 25.19%
Accuracy on Onset: 2.94%
Accuracy on Nucleus: 1.68%
Accuracy on Coda: 12.41%


From random guessing it is clear that our predictions are abysmal. Now we can experiment with Zero Rule (0R).

## Zero Rule

The Zero Rule procedure for classification simply returns the most frequently occurring class. Based on our data, the level tone is the most frequently occurring tone; by the Zero Rule, we would just predict the level tone every time.

In [7]:
for label_class, name in [
    (tone_label_names, 'Tone'),
    (onset_label_names, 'Onset'),
    (nucleus_label_names, 'Nucleus'),
    (coda_label_names, 'Coda')
]:
    most_frequent_class = matrix[label_class].sum().idxmax()
    idx = label_class.index(most_frequent_class)
    pred = np.zeros(matrix[label_class].shape)
    pred[:, idx] = 1
    
    acc = np.sum(np.sum(matrix[label_class] * pred)) / n_examples
    print('Accuracy on {}: {:.2f}%'.format(name, 100 * acc))

Accuracy on Tone: 45.19%
Accuracy on Onset: 8.37%
Accuracy on Nucleus: 8.18%
Accuracy on Coda: 40.14%


By predicting the most frequent class for each syllabic element, we achieve much better results than random guessing. As we make more complex models, we would need to make sure that we perform better than this baseline.