# MLP classifier with word embedding representation

In the previous attempts, the comments text has been represented using a bag-of-words model. We were able to attain a respectable score, however, it is clear that it has trouble generalizing to new data. Words that brings toxicity are often times made up on the spot, so how ever big our token count dictionary is it can't cover the test set appropriately. Adding N-grams does not improve the model, which is disappointing. My theory is that toxic words are usually concatenated to a single token (eg _cock\*\*er_) so the captured n-grams (where n > 1) are just uninformative English phrases.

In this notebook, we instead represent each comment as a dense word vector. We use a pre-trained model included with spaCy to vectorize each token, then we take the mean of the vectors as the document vector. Each document vector is fed into a multi-layered perceptron to predict the probabilites for each labels.

In [1]:
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score

import spacy
nlp = spacy.load('en_core_web_md')

from helper import load_data

In [2]:
x_train, x_validation, x_test, y_train, y_validation, y_test = load_data()

In [3]:
def comment2vec(text):
    doc = nlp(text)
    # Some times a sentense consist entirely of stop words and non-alpha characters,
    # we simply use all tokens in those cases.
    if sum(token.is_alpha for token in doc) > 10:
        tokens = [token.vector for token in doc if token.is_alpha]
    else:
        tokens = [token.vector for token in doc]
    return np.mean(np.vstack(tokens), axis=0)

In [4]:
def corpus2matrix(corpus):
    mat = np.zeros((len(corpus), 300), dtype=np.float32)
    for vec, text in zip(mat, corpus):
        vec[:] = comment2vec(text)
    return mat

In [5]:
x_train_vec = corpus2matrix(x_train)

In [6]:
model = MLPClassifier(
    hidden_layer_sizes=(50,),
    random_state=42,
    max_iter=300,
    early_stopping=True,
    n_iter_no_change=25,
    verbose=True
).fit(x_train_vec, y_train)

Iteration 1, loss = 3.58830550
Validation score: 0.614000
Iteration 2, loss = 2.69054374
Validation score: 0.634000
Iteration 3, loss = 2.13806123
Validation score: 0.644000
Iteration 4, loss = 1.75158131
Validation score: 0.652000
Iteration 5, loss = 1.53016163
Validation score: 0.658000
Iteration 6, loss = 1.40353446
Validation score: 0.674000
Iteration 7, loss = 1.32406694
Validation score: 0.680000
Iteration 8, loss = 1.26442988
Validation score: 0.698000
Iteration 9, loss = 1.21652085
Validation score: 0.696000
Iteration 10, loss = 1.17664764
Validation score: 0.698000
Iteration 11, loss = 1.14442680
Validation score: 0.696000
Iteration 12, loss = 1.11692232
Validation score: 0.694000
Iteration 13, loss = 1.09286073
Validation score: 0.704000
Iteration 14, loss = 1.06982721
Validation score: 0.708000
Iteration 15, loss = 1.05056518
Validation score: 0.706000
Iteration 16, loss = 1.03304410
Validation score: 0.708000
Iteration 17, loss = 1.01757602
Validation score: 0.712000
Iterat

## Validation & tuning

In our initial neural network with one hidden layer of size 100, the training score is reaching 1.0 while validation score is in the low 0.9s. This is a significant improvement over the previous Naive Bayes model, however, there also seems to be some overfitting. Deeper networks also overfitted wihout improving performance. We enabled the `early_stopping` option which stops training when validation scores stop improving to combat this problem. This also decrease training time which is a plus.

The tuned model uses one hidden layer of size 50, which exhibits good generalization and validation score.

In [7]:
x_validation_vec = corpus2matrix(x_validation)

In [8]:
y_train_pred = model.predict_proba(x_train_vec)
y_validation_pred = model.predict_proba(x_validation_vec)

print('Train ROC AUC:', roc_auc_score(y_train, y_train_pred))
print('Validation ROC AUC:', roc_auc_score(y_validation, y_validation_pred))

Train ROC AUC: 0.9861423536460044
Validation ROC AUC: 0.9689210539389465


## Model performance evaluation

In [9]:
x_test_vec = corpus2matrix(x_test)
y_test_pred = model.predict_proba(x_test_vec)
print('Test ROC AUC:', roc_auc_score(y_test, y_test_pred))

Test ROC AUC: 0.9673785646821336


In [10]:
roc_auc_score(y_test, y_test_pred, average=None)

array([0.99058215, 0.95571743, 0.97939484, 0.9554448 , 0.9684104 ,
       0.95472177])

Using word embeddings and neural networks improved our test scores by over 7% compared to the naive bayes model. Labels with relatively little training data also performs quite well. In the next notebook we will attempt to retrain the pretrained word vectors to improve coverage, as well as try using more sophisticated architectures.