# One-versus-rest classification for a more informed model

We have structured our data so that each example (character) has 1 onset, 1 nucleus, 1 coda, and 1 tone. It makes sense to try a one-versus-rest strategy on top of logistic regression and see how much our performance improves over instantiating a logistic regression classifier for every single one-hot encoded label.

- **11/29**: use updated dataset with ∅s.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [2]:
data_matrix = pd.read_csv('model/1129-fixed-data-matrix-karlgren.csv').set_index('character')
assert data_matrix.isnull().sum().sum() == 0 # no NaNs

In [3]:
onset_label_names   = [x for x in data_matrix.columns if 'Karlgren_onset' in x]
nucleus_label_names = [x for x in data_matrix.columns if 'Karlgren_nucleus' in x]
coda_label_names    = [x for x in data_matrix.columns if 'Karlgren_coda' in x]
tone_label_names    = [x for x in data_matrix.columns if 'tone_label' in x]

In [4]:
X = data_matrix.drop([
    *onset_label_names,
    *nucleus_label_names,
    *coda_label_names,
    *tone_label_names
], axis=1)

y_onset   = data_matrix[onset_label_names]
y_nucleus = data_matrix[nucleus_label_names]
y_coda    = data_matrix[coda_label_names]
y_tone    = data_matrix[tone_label_names]

to_pred_ys = {
    'Onset': y_onset,
    'Nucleus': y_nucleus,
    'Coda': y_coda,
    'Tone': y_tone
}

In [5]:
for y_type_name, to_pred_y in to_pred_ys.items():
    X_train, X_test, y_train, y_test = train_test_split(X, to_pred_y, test_size=.3, random_state=42)
    clf = OneVsRestClassifier(LogisticRegression(solver='liblinear')).fit(X_train, y_train)    
    y_pred = clf.predict(X_test)
    n_examples, _ = y_pred.shape
    y_test_arr = np.array(y_test)
    accuracy = np.mean([y_test_arr[row_idx] @ y_pred[row_idx] for row_idx in range(n_examples)])
    print('Accuracy for {}: {:.2f}%'.format(y_type_name, 100 * accuracy))

Accuracy for Onset: 59.80%
Accuracy for Nucleus: 51.54%
Accuracy for Coda: 91.17%
Accuracy for Tone: 75.54%


Using a one-versus-rest strategy over the same logistic regression classifier as before shows a marked improvement training a logistic regression classifier for each label.

In [6]:
from sklearn.neural_network import MLPClassifier

In [9]:
for y_type_name, to_pred_y in to_pred_ys.items():
    X_train, X_test, y_train, y_test = train_test_split(X, to_pred_y, test_size=.3, random_state=42)
    clf = OneVsRestClassifier(MLPClassifier(hidden_layer_sizes=(16,))).fit(X_train, y_train)    
    y_pred = clf.predict(X_test)
    n_examples, _ = y_pred.shape
    y_test_arr = np.array(y_test)
    accuracy = np.mean([y_test_arr[row_idx] @ y_pred[row_idx] for row_idx in range(n_examples)])
    print('Accuracy for {}: {:.2f}%'.format(y_type_name, 100 * accuracy))



Accuracy for Onset: 64.87%




Accuracy for Nucleus: 60.50%




Accuracy for Coda: 90.45%




Accuracy for Tone: 74.23%


