# Reconstruction baseline: multi-class logistic regression

## Baseline model:

We begin by trying to **individually** predict the onsets, nuclei and codas for Middle Chinese. Because different reconstruction schemes are not directly comparable, it makes sense for us to try to learn one single reconstruction scheme. For this baseline, we will use the Karlgren reconstruction (Karlgren 1957), one of the most famous reconstruction schemes.

Each Chinese character is a single syllable. By convention, each syllable is analyzed as **initial + rime**. For example, the English word **moon** can be analyzed as **m + oon**, because the **initial** is **m** and the rime is **oon**. Because we care about final consonants, we will also separate the **rime** into its constituent **nucleus** and **coda**, i.e. the vowel and the final consonant. We use `∅` to denote a **zero** consonant, or the *absence* of a consonantal sound.

Here is an actual example character from our data, **月** [[Wiktionary](https://en.wiktionary.org/wiki/月)] ('moon'):

```
月 <- {
	mandarin_initial: y
	mandarin_nucleus: ue
	mandarin_coda: ∅
	mandarin_tone: falling
	cantonese_initial: j
	...
	middle_chinese_tone_label: checked
	karlgren_middle_chinese_initial: ŋ
	karlgren_middle_chinese_nucleus: i̯wɐ
	karlgren_middle_chinese_coda: t̚
}
```

We want to predict the Middle Chinese tone, initial, nucleus, and coda. To do this, we first one-hot encode all the features and labels. For clarity, let's define set of labels (`tone_label_departing`, `Karlgren_coda_ŋ`, etc.). as $L$. Now let $m$ be the number of examples we have, $n_X$ be the number of features, and $n_y$ the number of possible labels. We want to predict each of these independently, based on our feature matrix $X$, so we will train $n_y$ logistic regression classifers, where each classifier is trying to predict a different label. 

In [188]:
import json
import pickle as pkl
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score, confusion_matrix
from scipy.sparse import csr_matrix

In [121]:
mc = pd.read_csv('preprocessing/dataset/pron/mc-full-table.tsv', sep='\t')
mc.head()

Unnamed: 0,character,index,tone_label,Zhengzhang_onset,Zhengzhang_nucleus,Zhengzhang_coda,Pan_onset,Pan_nucleus,Pan_coda,Shao_onset,...,Shao_coda,Pulleyblank_onset,Pulleyblank_nucleus,Pulleyblank_coda,Li_onset,Li_nucleus,Li_coda,Karlgren_onset,Karlgren_nucleus,Karlgren_coda
0,一,5611001,checked,ʔ,iɪ,t̚,ʔ,i,t̚,ʔ,...,t̚,ʔ,i,t̚,ʔ,iĕ,t̚,ʔ,i̯ĕ,t̚
1,丁,5611004,level,ʈ,ˠɛ,ŋ,ʈ,ᵚæ,ŋ,ȶ,...,ŋ,ʈ,əɨj,ŋ,ȶ,ɛ,ŋ,ȶ,æ,ŋ
2,丂,5611009,rising,kʰ,ɑu,∅,kʰ,ɑu,∅,kʰ,...,∅,kʰ,aw,∅,kʰ,ɑu,∅,kʰ,ɑu,∅
3,七,5611012,checked,t͡sʰ,iɪ,t̚,t͡sʰ,i,t̚,t͡sʰ,...,t̚,t͡sʰ,i,t̚,t͡sʰ,iĕ,t̚,t͡sʰ,i̯ĕ,t̚
4,丄,5611016,rising,d͡ʑ,ɨɐ,ŋ,d͡ʑ,iɐ,ŋ,d͡ʑ,...,ŋ,d͡ʑ,ɨa,ŋ,ʑ,ia,ŋ,ʑ,i̯a,ŋ


In [122]:
cols = ['character', 'index', 'tone_label', 'Karlgren_onset', 'Karlgren_nucleus', 'Karlgren_coda']
mc_karlgren = mc[cols]
mc_karlgren.head()

Unnamed: 0,character,index,tone_label,Karlgren_onset,Karlgren_nucleus,Karlgren_coda
0,一,5611001,checked,ʔ,i̯ĕ,t̚
1,丁,5611004,level,ȶ,æ,ŋ
2,丂,5611009,rising,kʰ,ɑu,∅
3,七,5611012,checked,t͡sʰ,i̯ĕ,t̚
4,丄,5611016,rising,ʑ,i̯a,ŋ


In [168]:
tones  = pd.get_dummies(mc_karlgren.tone_label, prefix='tone_label')
onsets = pd.get_dummies(mc_karlgren.Karlgren_onset, prefix='Karlgren_onset')
nuclei = pd.get_dummies(mc_karlgren.Karlgren_nucleus, prefix='Karlgren_nucleus')
codas  = pd.get_dummies(mc_karlgren.Karlgren_coda, prefix='Karlgren_coda')
karlgren_full = mc_karlgren.drop([
    'tone_label',
    'Karlgren_onset',
    'Karlgren_nucleus',
    'Karlgren_coda'], axis=1).join([
        tones,
        onsets,
        nuclei,
        codas])
karlgren_full.shape

(19453, 113)

Now, we load our features and one-hot encode the phonological units as before.

In [182]:
with open('preprocessing/dataset/processed.pkl', 'rb') as f:
    X = pkl.load(f, encoding='utf-8').drop([
        'Column5', 'Column6', 'Column7', 'Column8',
#         'canto_onset', 'canto_nucl', 'canto_coda', 'canto_tone',
#         'mando_tone', 'mando_onset', 'mando_nucl', 'mando_coda', 
#         'jp_on_first', 'jp_on_rest', 'jp_kan_first', 'jp_kan_rest',
#         'kor_0', 'kor_1', 'kor_2'
    ], axis=1)
print(list(X.columns))
# fill with nulls
X = X.replace(to_replace={'': '∅'})

to_encode = X.columns[2:]
for col in to_encode:
    dummies = pd.get_dummies(X[col], prefix=col)
    X = X.drop(col, axis=1).join(dummies)
print('+ Loaded feature matrix X with {} binary phonological features'.format(X.shape[1] - 2))
X.head()

['character', 'wiki_id', 'mando_onset', 'mando_nucl', 'mando_coda', 'mando_tone', 'canto_onset', 'canto_nucl', 'canto_coda', 'canto_tone', 'jp_on_first', 'jp_on_rest', 'jp_kan_first', 'jp_kan_rest', 'kor_0', 'kor_1', 'kor_2']
+ Loaded feature matrix X with 362 binary phonological features


Unnamed: 0,character,wiki_id,mando_onset_b,mando_onset_c,mando_onset_ch,mando_onset_d,mando_onset_f,mando_onset_g,mando_onset_h,mando_onset_j,...,kor_1_ㅡ,kor_1_ㅢ,kor_1_ㅣ,kor_2_∅,kor_2_ㄱ,kor_2_ㄴ,kor_2_ㄹ,kor_2_ㅁ,kor_2_ㅂ,kor_2_ㅇ
0,犬,598,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,馬,599,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,西,603,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,車,604,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,酉,650,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [183]:
print('{:.1f}% of our matrix is zero.'.format(
    100 * np.sum(X == 0).sum() / (X.shape[0]-2) / X.shape[1]))

95.4% of our matrix is zero.


In [184]:
label_columns = karlgren_full.columns[2:]
full_matrix = X.set_index('character').join(
    karlgren_full.set_index('character')).fillna(0).drop(['index', 'wiki_id'], axis=1)

In [185]:
full_matrix = full_matrix.drop('㠛') # (malformed, not very important)

We have around 15K examples, and we will use a 70/30 train/test split.

In [219]:
# evaluation metrics
def get_f_beta(tn, fp, fn, tp, beta=1):
    return (1 + beta ** 2) * tp / ((1 + beta ** 2) * tp + beta ** 2 * fn + fp)
def get_precision(tn, fp, fn, tp):
    return tp / (tp + fp)
def get_recall(tn, fp, fn, tp):
    return tp / (tp + fn)

In [220]:
raw_test_accs = []
eval_func = get_recall
for target_column_name in label_columns:
    target_column = full_matrix[target_column_name]
    if not target_column.sum():
        continue
    features = full_matrix.drop(label_columns, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(features, target_column, test_size=.3, random_state=42)

    clf = LogisticRegression(solver='liblinear').fit(X_train, y_train)
    
    raw_train_acc = clf.score(X_train, y_train)
    pred = clf.predict(X_test)
    
    cm = confusion_matrix(y_test, pred)
    if len(cm.flatten()) == 1:
        eval_score = 0
    else:
        eval_score = eval_func(*list(cm.flatten()))
    eval_score = 0 if eval_score != eval_score else eval_score
#     print('{:.3f} <- accuracy for {}'.format(eval_score, target_column_name))
    raw_test_accs.append(eval_score)
#     break

In [221]:
print('Mean recall: {:.3f}'.format(np.mean(raw_test_accs)))
print('Median recall: {:.3f}'.format(np.median(raw_test_accs)))

Mean recall: 0.472
Median recall: 0.512
