# Insult Classification

In this exercise, we would like to filter out insulting comments on a web forum. 

To train our models, we have a list of historic comments with a judgement wether they're insulting or not.

In [1]:
import pandas as pd
path_to_insults = 'data/'
data = pd.read_csv(path_to_insults + 'train-utf8.csv')
data.head(10)

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,You fuck your dad.
1,0,20120528192215Z,i really don't understand your point. It seem...
2,0,,A majority of Canadians can and has been wrong...
3,0,,listen if you dont wanna get married to a man ...
4,0,20120619094753Z,Các bạn xuống đường biểu tình 2011 có ôn hoà k...
5,0,20120620171226Z,"@SDL OK, but I would hope they'd sign him to a..."
6,0,20120503012628Z,Yeah and where are you now?
7,1,,shut the fuck up. you and the rest of your fag...
8,1,20120502173553Z,Either you are fake or extremely stupid...mayb...
9,1,20120620160512Z,That you are an idiot who understands neither ...


In [2]:
print ("%d comments, of which %d insults (%d%%)" % \
    (len(data), data.Insult.sum(), 100 * data.Insult.mean()))

3947 comments, of which 1049 insults (26%)


### Looking for known bad words

One way to do this, is to load Google's bad word list and flag comments that contain one or more words.

- Load `google_badlist.txt` from `data/insults/`
- Add a column to `data` with a flag (0 or 1) if the comment contains a bad word
- Compute the accuracy of this method - does this look good?
- What would a naive classifier's score be (i.e., always predicting 0 or 1)?
- Also compute the precision, recall, F1 score and AUC score
- What is your verdict?

In [3]:
filename = path_to_insults + 'google_badlist.txt'
filename

'data/google_badlist.txt'

In [4]:
bad_words = pd.read_table(filename,header=None,names=['badWord'])


In [5]:
def isBadWord(doc):
    if any([word in doc.split() for word in bad_words.badWord] ):
        return 1
    return 0

In [6]:
data['prediction']=data.Comment.apply(lambda x:isBadWord(x))



In [7]:
accuracy = (float(len(data[data["Insult"] == data["prediction"]])) / len(data))* 100
accuracy


70.81327590575121

The accuracy should be more, this seems to be low 

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data["Comment"])
X_train, X_test, Y_train, Y_test = train_test_split(
X_train_counts, data.Insult, test_size=0.5, random_state=0)
clf = MultinomialNB()
clf.fit(X_train,Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [9]:
clf.score(X_test, Y_test) * 100
#accuracy after training

78.01418439716312

In [10]:
#naive classifier with 50% training data
from sklearn import metrics
print(metrics.classification_report(Y_test, clf.predict(X_test)))

             precision    recall  f1-score   support

          0       0.81      0.91      0.86      1429
          1       0.65      0.45      0.53       545

avg / total       0.77      0.78      0.77      1974



In [11]:
print(metrics.classification_report(data["Insult"], data["prediction"]))

             precision    recall  f1-score   support

          0       0.77      0.87      0.81      2898
          1       0.42      0.27      0.33      1049

avg / total       0.68      0.71      0.68      3947



# It seems that naive bayes is better than the previous model.
As precison, recall, f1-score is high in naive bayes for insults.

Thus Naive bayes seems to be a clear winner

### Learning bad words on the fly

Another way of doing this, is to learn the insulting words on the fly using `CountVectorizer`. 

Please refer to the scikit learn tutorial at 'http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html' if you need some help.

Here is what you need to do:

- Import `CountVectorizer` from `sklearn.feature_extraction.text`
- Train the `CountVectorizer` on the insults and create a feature set $X$ representing words in the comments
- Train `MultinomialNB` and `BernoulliNB` from `scikitsklearn`  on the new feature set $X$
- Using cross-validation, compute the accuracy, precision, recall, F1 and AUC of your model
- What is your verdict?

NOTE: The F1 score is another useful score to compute when one of the two classes is very rare. We didn't go over it in class but it's basically the harmonic mean between precision and recall and goes from 0 (min) to 1 (max).  You can see more here: 'https://en.wikipedia.org/wiki/F1_score' 

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import cross_val_score



In [13]:
del data['prediction']

In [14]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data["Comment"])
clf_multinomial=MultinomialNB()
clf_multinomial.fit(X_train_counts, data.Insult)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
from sklearn.naive_bayes import BernoulliNB
clf_binomial=BernoulliNB()
clf_binomial.fit(X_train_counts, data.Insult)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [16]:
#Bernoulli
from sklearn.model_selection import cross_val_score
metrics= ['accuracy', 'precision', 'f1', 'recall', 'roc_auc']
for metric in metrics:
    sc=cross_val_score(BernoulliNB(), X_train_counts, data["Insult"],scoring=metric)
    print(metric,"is with mean=",sc.mean() ,"and SD=",sc.std())


accuracy is with mean= 0.740310924143 and SD= 0.00779666420524
precision is with mean= 0.556391121316 and SD= 0.0814113532582
f1 is with mean= 0.16810091495 and SD= 0.0383787845808
recall is with mean= 0.0991622322281 and SD= 0.0239296795615
roc_auc is with mean= 0.819083259814 and SD= 0.00257648952437


In [17]:
#Multinomial
for metric in metrics:
    sc=cross_val_score(MultinomialNB(), X_train_counts, data["Insult"],scoring=metric)
    print(metric,"is with mean=",sc.mean() ,"and SD=",sc.std())

accuracy is with mean= 0.787687080333 and SD= 0.000891868788176
precision is with mean= 0.605393351267 and SD= 0.0052512399194
f1 is with mean= 0.591467682267 and SD= 0.00735937178716
recall is with mean= 0.578621912949 and SD= 0.018260974227
roc_auc is with mean= 0.788669213626 and SD= 0.00671398358448


# Verdict 
Clearly Multinomial is better than bernoulli as all the metrics has a higher value 
