# An alternate baseline model: A horde of logistic regressions

We want to predict the Middle Chinese tone, initial, nucleus, and coda. To do this, we first one-hot encode all the features and labels. For clarity, let's define set of labels (`tone_label_departing`, `Karlgren_coda_ŋ`, etc.). as $L$. Now let $m$ be the number of examples we have, $n_X$ be the number of features, and $n_y$ the number of possible labels. We want to predict each of these independently, based on our feature matrix $X$, so we will train $n_y$ logistic regression classifers, where each classifier is trying to predict a different label.

- **11/29 fix**: use updated (fixed) data matrix

In [12]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import average_precision_score, confusion_matrix

In [13]:
full_matrix = pd.read_csv('model/1129-fixed-data-matrix-karlgren.csv').set_index('character')

In [14]:
label_columns = [x for x in list(full_matrix.columns) if 'Karlgren' in x]

We have around 15K examples, and we will use a 70/30 train/test split.

In [15]:
# evaluation metrics
def get_f_beta(tn, fp, fn, tp, beta=1):
    return (1 + beta ** 2) * tp / ((1 + beta ** 2) * tp + beta ** 2 * fn + fp)
def get_precision(tn, fp, fn, tp):
    return tp / (tp + fp)
def get_recall(tn, fp, fn, tp):
    return tp / (tp + fn)

In [16]:
metrics = []
raw_accs = []
eval_func = get_recall
for target_column_name in label_columns:
    target_column = full_matrix[target_column_name]
    if not target_column.sum():
        continue
        
    features = full_matrix.drop(label_columns, axis=1)
    X_train, X_test, y_train, y_test = train_test_split(
        features,
        target_column,
        test_size=.3,
        random_state=42)

    clf = LogisticRegression(solver='liblinear').fit(X_train, y_train)
    
    raw_train_acc = clf.score(X_train, y_train)
    pred = clf.predict(X_test)

    cm = confusion_matrix(y_test, pred).flatten()
    eval_score = eval_func(*list(cm)) if len(cm) != 1 else 0
#     eval_score = 0 if eval_score != eval_score else eval_score # handle NaNs
#     print('{:.3f} <- accuracy for {}'.format(eval_score, target_column_name))
    metrics.append(eval_score)
    raw_accs.append(clf.score(X_test, y_test))

In [17]:
print('Raw accuracy: {:.3f}'.format(np.mean(raw_accs)))
print('Mean recall: {:.3f}'.format(np.mean(metrics)))
print('Median recall: {:.3f}'.format(np.median(metrics)))

Raw accuracy: 0.988
Mean recall: 0.470
Median recall: 0.516


## Why is our accuracy so high, yet our recall is so low?

Our strategy was to instantiate a logistic regression classifier for every one-hot encoded label category. But since there were $>100$ label categories, the column vector corresponding to each label category was very sparse. Most of our classifiers had no reason to truly learn; all they needed to do was to predict $0$. For every example, we need a way to force our model to predict exactly 1 onset, exactly 1 nucleus, etc.