# MLP classifier with word embeddings representation

In the previous attempts, the comments text has been represented using a bag-of-words model. We were able to attain a respectable score, however, it is clear that it has trouble generalizing to new data. Words that brings toxicity are often times made up on the spot, so how ever big our token count dictionary is it can't cover the test set appropriately. N-grams made maginal improvements to the model, which is disappointing. My theory is that toxic words are usually concatenated to a single token (eg _cock\*\*er_) so the captured n-grams (where n > 1) are just uninformative English phrases.

In this notebook, we instead represent each comment as a dense word vector. We use a pre-trained model included with spaCy to vectorize each token, then we take the mean of the vectors as the document vector. Each document vector is fed into a multi-layered perceptron to predict the probabilites for each labels.

In [1]:
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score

import spacy
nlp = spacy.load('en_core_web_md')

from helper import train_validation_test_data

In [2]:
x_train, x_validation, x_test, y_train, y_validation, y_test = train_validation_test_data()

In [3]:
def comment2vec(text):
    doc = nlp(text)
    # Some times a sentense consist entirely of stop words and non-alpha characters,
    # we simply use all tokens in those cases.
    if sum(token.is_alpha for token in doc) > 10:
        tokens = [token.vector for token in doc if token.is_alpha]
    else:
        tokens = [token.vector for token in doc]
    return np.mean(np.vstack(tokens), axis=0)

In [4]:
def corpus2matrix(corpus):
    mat = np.zeros((len(corpus), 300), dtype=np.float32)
    for vec, text in zip(mat, corpus):
        vec[:] = comment2vec(text)
    return mat

In [5]:
x_train_vec = corpus2matrix(x_train)

In [6]:
model = MLPClassifier(
    hidden_layer_sizes=(50,),
    random_state=42,
    max_iter=300,
    early_stopping=True,
    n_iter_no_change=25,
    verbose=True
).fit(x_train_vec, y_train)

Iteration 1, loss = 3.12207724
Validation score: 0.898000
Iteration 2, loss = 1.39896767
Validation score: 0.900000
Iteration 3, loss = 0.89650666
Validation score: 0.900000
Iteration 4, loss = 0.76909411
Validation score: 0.900000
Iteration 5, loss = 0.68172936
Validation score: 0.900000
Iteration 6, loss = 0.61459606
Validation score: 0.898000
Iteration 7, loss = 0.56311488
Validation score: 0.896000
Iteration 8, loss = 0.52075052
Validation score: 0.900000
Iteration 9, loss = 0.49154285
Validation score: 0.906000
Iteration 10, loss = 0.47120626
Validation score: 0.906000
Iteration 11, loss = 0.45510609
Validation score: 0.904000
Iteration 12, loss = 0.44369261
Validation score: 0.904000
Iteration 13, loss = 0.43413240
Validation score: 0.902000
Iteration 14, loss = 0.42555208
Validation score: 0.904000
Iteration 15, loss = 0.41783812
Validation score: 0.906000
Iteration 16, loss = 0.41244502
Validation score: 0.910000
Iteration 17, loss = 0.40927603
Validation score: 0.904000
Iterat

## Validation & tuning

In our initial neural network with one hidden layer of size 100, the training score is reaching 1.0 while validation score is in the low 0.9s. This is a significant improvement over the previous Naive Bayes model, however, there also seems to be some overfitting. Deeper networks also overfitted wihout improving performance. We enabled the `early_stopping` option which stops training when validation scores stop improving to combat this problem. This also decrease training time which is a plus.

The tuned model uses one hidden layer of size 50, which exhibits good generalization and validation score.

In [7]:
x_validation_vec = corpus2matrix(x_validation)

In [8]:
y_train_pred = model.predict_proba(x_train_vec)
y_validation_pred = model.predict_proba(x_validation_vec)

print('Train ROC AUC:', roc_auc_score(y_train, y_train_pred))
print('Validation ROC AUC:', roc_auc_score(y_validation, y_validation_pred))

Train ROC AUC: 0.9756677652855248
Validation ROC AUC: 0.9516890736679461


## Model performance evaluation

In [9]:
x_test_vec = corpus2matrix(x_test)
y_test_pred = model.predict_proba(x_test_vec)
print('Test ROC AUC:', roc_auc_score(y_test, y_test_pred))

Test ROC AUC: 0.965614779659225


In [12]:
roc_auc_score(y_test, y_test_pred, average=None)

array([0.94993349, 0.988347  , 0.97138947, 0.99498495, 0.96367297,
       0.92536079])

Using word embeddings and neural networks improved our test scores by over 20% compared to the naive bayes model. Labels with relatively little training data also performs quite well. In the next notebook we will attempt to retrain the pretrained word vectors to improve coverage, as well as try using more sophisticated architectures.