# Naïve Bayes classifier with BOW representation

In this notebook we obtain a baseline score for the toxic comment classification using a simple Naïve Bayes classifier.

Result:
The ROC AUC score for this model is 0.78

In [1]:
import spacy
nlp = spacy.load('en_core_web_sm')

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

from helper import train_validation_test_data

In [2]:
# We aren't doing any tuning on for the baseline so we will
# ignore the validation set
x_train, _, x_test, y_train, _, y_test = train_validation_test_data()

## Transform corpus to bag-of-words representation

* Remove stop words and non-alpha text to reduce dictionary size
* Can't use the faster HashingVectorizer since that doesn't support exporting a dictionary for later vectorizing the validation/test set

In [3]:
def tokenize(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]

vectorizer = CountVectorizer(lowercase=True, tokenizer=tokenize)

In [4]:
train_bow = vectorizer.fit_transform(x_train)
train_bow

<5000x17711 sparse matrix of type '<class 'numpy.int64'>'
	with 110126 stored elements in Compressed Sparse Row format>

In [5]:
classifier = OneVsRestClassifier(MultinomialNB(), n_jobs=-1)
classifier.fit(train_bow, y_train)

OneVsRestClassifier(estimator=MultinomialNB(), n_jobs=-1)

### Evaluate the classifier on the test set

The metrics for the classifier is mean column-wise ROC AUC

In [6]:
y_train_pred = classifier.predict_proba(train_bow)

test_bow = vectorizer.transform(x_test)
y_test_pred = classifier.predict_proba(test_bow)

In [7]:
print('Train ROC AUC:', roc_auc_score(y_train, y_train_pred))
print('Test ROC AUC:', roc_auc_score(y_test, y_test_pred))

Train ROC AUC: 0.9285338821225851
Test ROC AUC: 0.7871971799805966
