Credit to Jeremy Howard for the guide on setting up this Naive Bayes SVM Model

https://www.kaggle.com/code/jhoward/nb-svm-strong-linear-baseline/notebook

Training and Test set was taken from Jigsaw Toxic Comment Classification Challenge

https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
train = pd.read_csv("/content/drive/MyDrive/train.csv")
test = pd.read_csv('/content/drive/MyDrive/test.csv')

## Looking at the data

The training data contains a row per comment, with an id, the text of the comment, and 6 different labels that we'll try to predict.

In [None]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


We'll create a list of all the labels to predict, and we'll also create a 'none' label so we can see how many comments have no labels. We can then summarize the dataset.

In [None]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train['none'] = 1-train[label_cols].max(axis=1)
train.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate,none
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805,0.898321
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342,0.302226
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
len(train),len(test)

(159571, 153164)

In [None]:
train['y'] = (train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)

There are a few empty comments that we need to get rid of, otherwise sklearn will complain.

In [None]:
COMMENT = 'comment_text'
train[COMMENT].fillna("unknown", inplace=True)
test[COMMENT].fillna("unknown", inplace=True)

## Building the model

We'll start by creating a *bag of words* representation, as a *term document matrix*. We'll use ngrams, as suggested in the NBSVM paper.

In [None]:
import re, string
re_tok = re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])')
def tokenize(s): return re_tok.sub(r' \1 ', s).split()

It turns out that using TF-IDF gives even better priors than the binarized features used in the paper. I don't think this has been mentioned in any paper before, but it improves leaderboard score from 0.59 to 0.55.

In [None]:
n = train.shape[0]
vec = TfidfVectorizer(ngram_range=(1,2), tokenizer=tokenize,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
trn_term_doc = vec.fit_transform(train[COMMENT])
test_term_doc = vec.transform(test[COMMENT])

This creates a *sparse matrix* with only a small number of non-zero elements (*stored elements* in the representation  below).

In [None]:
trn_term_doc, test_term_doc

(<159571x426005 sparse matrix of type '<class 'numpy.float64'>'
 	with 17775119 stored elements in Compressed Sparse Row format>,
 <153164x426005 sparse matrix of type '<class 'numpy.float64'>'
 	with 14765768 stored elements in Compressed Sparse Row format>)

Here's the basic naive bayes feature equation:

In [None]:
def pr(y_i, y):
    p = x[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [None]:
x = trn_term_doc
test_x = test_term_doc

Fit a model for one dependent at a time:

In [None]:
def get_mdl(y):
    y = y.values
    r = np.log(pr(1,y) / pr(0,y))
    m = LogisticRegression(C=4, max_iter=1000)
    x_nb = x.multiply(r)
    return m.fit(x_nb, y), r

In [None]:
preds = np.zeros(len(test))

m,r = get_mdl(train['y'])

In [None]:
preds = m.predict(test_x.multiply(r))

We test the model on the test set and measure accuracy

In [None]:
datalabels = pd.read_csv("/content/drive/MyDrive/test_labels.csv")
dataset = test.append(datalabels)
dataset['y'] = (dataset[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
dataset = dataset[['comment_text', 'y']].rename(columns={'comment_text': 'text'})
dataset['text'].fillna("unknown", inplace=True)

In [None]:
totalScore = 0
entries = 0
for i, score in enumerate(preds):
  if dataset.iat[i, 0] != 'unknown' and datalabels.iat[i, 1] != -1:
    entries += 1
    if score == dataset.iat[i, 1]:
      totalScore += 1
    if dataset.iat[i, 1] == 1:
      print(dataset.iat[i, 0])


print(totalScore/entries)

0.8680015005158023


In [None]:
m.predict(vec.transform(['You are very nice']).multiply(r))

array([0])

We export the model with pickle to use in other applications.

In [None]:
import pickle

In [None]:
filename = 'NB_SVM_model.sav'
pickle.dump(m, open(filename, 'wb'))
vectorfile = 'vectorizer.sav'
pickle.dump(vec, open(vectorfile, 'wb'))