In [1]:
import numpy as np
import pandas as pd
import sklearn
import sklearn.model_selection as ms
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
# %matplotlib inline

We will work on dataset regarding the comments either being "insulting" (1) or not (0).  
We download and read the csv file from `https://github.com/ipython-books/cookbook-2nd-data/blob/master/troll.csv`

In [3]:
df = pd.read_csv('https://github.com/ipython-books/'
                 'cookbook-2nd-data/blob/master/'
                 'troll.csv?raw=true')

df[['Insult', 'Comment']].tail()

Unnamed: 0,Insult,Comment
3942,1,"""you are both morons and that is never happening"""
3943,0,"""Many toolbars include spell check, like Yahoo..."
3944,0,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,"""How about Felix? He is sure turning into one ..."
3946,0,"""You're all upset, defending this hipster band..."


Tokenize the comments and transform them into the feature spase `X`

In [4]:
y = df['Insult']
tf = text.TfidfVectorizer()
X = tf.fit_transform(df['Comment'])
print("X.shape: ", X.shape)

X.shape:  (3947, 16469)


This shows that there are 3947 comments and 16469 different words. Let's estimate the sparsity of this feature matrix:

In [5]:
p = 100 * X.nnz / float(X.shape[0] * X.shape[1])
print(f"Each sample has ~{p:.2f}% non-zero features.")

Each sample has ~0.15% non-zero features.


Now, we are going to train a classifier as usual. We first split the data into a train and test set:

In [6]:
(X_train, X_test, y_train, y_test) = ms.train_test_split(X, y, test_size=.2)

We use a Bernoulli Naive Bayes (nbn) classifier with a grid search on the α parameter:

In [7]:
bnb = ms.GridSearchCV(
    nb.BernoulliNB(),
    param_grid={'alpha': np.logspace(-2., 2., 50)})
bnb.fit(X_train, y_train)

Let's check the performance of this classifier on the test dataset:

In [8]:
bnb.score(X_test, y_test)

0.8088607594936709

Let's take a look at the words corresponding to the largest coefficients (the words we find frequently in insulting comments):

In [9]:
# We first get the words corresponding to each feature
names = np.asarray(tf.get_feature_names_out())
# Next, we display the 50 words with the largest coefficients.
print(','.join(names[np.argsort(
    bnb.best_estimator_.feature_log_prob_[0, :])[::-1][:50]]))

the,you,to,and,of,is,are,it,that,in,for,on,your,have,not,be,like,they,this,with,all,xa0,he,so,what,if,just,but,up,as,we,was,can,do,will,one,about,or,no,who,out,don,at,from,get,would,an,when,me,by


Finally, let's test our estimator on a few test sentences:

In [10]:
print(bnb.predict(tf.transform([
    "I totally agree with you.",
    "You are so stupid."
])))

[0 1]


The class `0` shows that the first sentence is not insulting but class `1` shows that the second sentence is insulting.

### References

[https://ipython-books.github.io/84-learning-from-text-naive-bayes-for-natural-language-processing/](https://ipython-books.github.io/84-learning-from-text-naive-bayes-for-natural-language-processing/)  
[https://www.kaggle.com/c/detecting-insults-in-social-commentary](https://www.kaggle.com/c/detecting-insults-in-social-commentary)