# Hyper-parameter tuning of Multinomial Naive-Bayes classifier for toxic comment detector

Created by John Burt

This notebook is an example of how to tune parameters for a comment toxicity detector based on toxicity scored text data provided for the 2018 Portland Data Science Group NLP workshop meetup. The classifier model is a Multinomial Naive-Bayes classifier, and I use GridSearchCV to test the effect on accuracy of changes to various parameters.

Setup:
- Load comment and score data files, downloaded from http://dive-into.info/ 
- Combine comment and toxicity score data and generate toxicity categories (toxic vs non-toxic) for classifier training and prediction.

Text pre-processing:
- Clean up text by dropping non-alpha characters.
- Drop words < 3 chars.
- Use snowball stemmer to stem the words.

Classifier method:
- Use TfidfVectorizer to transform text into word count vectors and apply TDF/IDF algorithm to weight vectors.
- MultinomialNB classifier using vectorized text data. 

Hyperparameter tuning:
- Create a pipeline object containing the TfidfVectorizer and MultinomialNB stages.
- Define a set of parameters and values to test.
- Test the parameter values using GridSearchCV, which returns the best set of parameters.
- Validate best model from GridSearchCV using kfolds cross validation.

***
***

## Set up the notebook plot environment, import some basic modules, and load the data.

Notes:

- This block also does a minimal bit of cleanup, by removing the embedded text "NEWLINE_TOKEN" and "TAB_TOKEN".

In [22]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

# set matplotlib environment and import some basics
%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 100 # set to -1 to see entire text

# ******************************************************
# load the wikipedia toxicity data provided by Matt
# ******************************************************
# set True to load the smaller data set, False to load the large data set
# NOTE: this code assumes data files are in the same folder as the notebook.
if False:
    # comment filename
    commentfile = 'toxicity_annotated_comments_unanimous.tsv'
    # rating filename
    ratingfile = 'toxicity_annotations_unanimous.tsv'

# full data set
else:
    # comment filename
    commentfile = 'toxicity_annotated_comments.tsv'
    # rating filename
    ratingfile = 'toxicity_annotations.tsv'

# load annotated comments
comments = pd.read_table(commentfile)
ratings = pd.read_table(ratingfile)

# remove weird tab/newline TOKEN text
comments['comment'] = comments['comment'].str.replace('NEWLINE_TOKEN','\n')
comments['comment'] = comments['comment'].str.replace('TAB_TOKEN','')

# show shape of each data set
print("comments.shape = ",comments.shape)
print("ratings.shape = ",ratings.shape)


comments.shape =  (159686, 7)
ratings.shape =  (1598289, 4)


In [23]:
import nltk

# uncomment this to download nltk corpus content, if you haven't done this already. 
#  This needs to be done once, and takes a while. 
#  I downloaded everything, but probably the "popular packages" will suffice.
#nltk.download()

## Combine comments and scores into one dataset

Use Pandas groupby function to calculate the mean and median for each comment, and add them as columns to the comment dataframe. Now I have comments and two measures of score aligned.

Next, I create a new toxicity categorical variable (0=not toxic, 1=toxic) by thresholding the median score at 0. I use the median score here because it is less sensitive to outlier scores than the mean.

Note that I don't use the mean or median scores beyond this point - the Naive Bayes classifier wants a categorical variable. However, you could potentially do some other interesting things with these scores, including implement a different classifier that makes use of score data.

In [24]:
scoredcomments = comments.copy()
# group all scores by comment ID for each text sample, add mean and median score columns to comment data 
scoredcomments["mean_score"] = pd.Series(ratings.groupby("rev_id",as_index=False).mean()["toxicity_score"])
scoredcomments["median_score"] = pd.Series(ratings.groupby("rev_id",as_index=False).median()["toxicity_score"])

# create catgorical variable toxicity: if median score < 0, toxicity=1, otherwise 0
scoredcomments["toxicity"] = (scoredcomments["median_score"] < 0).astype(int)

# make the comment id s ints
scoredcomments.rev_id = np.int64(scoredcomments.rev_id)

print("scoredcomments.shape = ",scoredcomments.shape)
scoredcomments.head()

scoredcomments.shape =  (159686, 10)


Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,mean_score,median_score,toxicity
0,2232,This:\n:One can make an analogy in mathematical terms by envisioning the distribution of opinion...,2002,True,article,random,train,0.4,0.5,0
1,4216,"`\n\n:Clarification for you (and Zundark's right, i should have checked the Wikipedia bugs page...",2002,True,user,random,train,0.5,0.0,0
2,8953,Elected or Electoral? JHK,2002,False,article,random,test,0.1,0.0,0
3,26547,`This is such a fun entry. Devotchka\n\nI once had a coworker from Korea and not only couldn't...,2002,True,article,random,train,0.6,0.0,0
4,28959,"Please relate the ozone hole to increases in cancer, and provide figures. Otherwise, this articl...",2002,True,article,random,test,0.2,0.0,0




## Clean up the text

- Remove non alpha chars (numbers, etc)
- Drop words less than 3 chars
- Stem the words with snowball stemmer



In [25]:
%%time
import re
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords as sw

#stemmer = PorterStemmer() # alternate stemmer
stemmer = SnowballStemmer(language='english')

# set up regex expression to remove all but alpha chars and whitespace
regex = re.compile('[^a-zA-Z\s]') 

numsamples = scoredcomments.comment.shape[0]

# set minimum word size. Words with fewer characters are dropped. 
#  I do this because there are a lot of 2 char initials in the comment data, which I think aren't useful,
#   ... or maybe they are - adjust this and see what happens!
minwordsize = 3

print("Processing %d samples:"%(numsamples))

# transform each sample text:
stemmed_text = []
for text,i in zip(scoredcomments.comment,range(numsamples)):
    # set to lower case
    text = regex.sub('',text.lower())
    # look at each word in text
    t = []
    for word in word_tokenize(text):
        # drop "words" that are too short or long (otherwise stem crashes!)
        if len(word) >= minwordsize and len(word) < 30: 
            t.append(stemmer.stem(word)) # stem the added word
    stemmed_text.append(" ".join(t)) # re-combine list of stemmed words
    if not i%5000: print(i,',', end="")

stemmed_text = pd.Series(np.array(stemmed_text)) # convert list of sample texts to pandas series
    
print("\n\nstemmed_text:\n",stemmed_text[:3])


Processing 159686 samples:
0 ,5000 ,10000 ,15000 ,20000 ,25000 ,30000 ,35000 ,40000 ,45000 ,50000 ,55000 ,60000 ,65000 ,70000 ,75000 ,80000 ,85000 ,90000 ,95000 ,100000 ,105000 ,110000 ,115000 ,120000 ,125000 ,130000 ,135000 ,140000 ,145000 ,150000 ,155000 ,

stemmed_text:
 0    this one can make analog mathemat term envis the distribut opinion popul gaussian curv would the...
1    clarif for you and zundark right should have check the wikipedia bug page first this bug the cod...
2                                                                                      elect elector jhk
dtype: object
Wall time: 2min 45s


## Equalize the number of samples of toxic and non-toxic comments

If you train multiNB classifier to all of the comment data, you will notice that it is very good at classifying non-toxic comments, but very poor at classifying toxic comments. That's because 89% of the training data are non-toxic and the classifier is being trained to have a bias toward labelling most comments as non-toxic, because this gives the highest overall accuracy.

The accuracy at classifying toxic comments dramatically improves when you equalize the number of non-toxic and toxic comments passed to the classifier for training, with only a small drop in accuracy for non-toxic. 

In [26]:
# number of samples to generate for each text category = # toxic comments
numtrainingsamples = np.sum(scoredcomments.toxicity==1)

#text = commentdata.comment # use this to work with un-modified comment data
text = np.array(stemmed_text) # use this to work with the cleaned and stemmed comment data

# split the data by category.
ind, = np.where(scoredcomments.toxicity==0)
X_nontoxic = text[ind]
target_nontoxic = scoredcomments.toxicity.values[ind]
ind, = np.where(scoredcomments.toxicity==1)
X_toxic = text[ind]
target_toxic = scoredcomments.toxicity[ind]

print("original data:")
print("#nontoxic = ",X_nontoxic.size)
print("#toxic = ",X_toxic.size)

# recombine the data with equalized number of samples of each category
X_text = np.concatenate( (X_nontoxic[:numtrainingsamples],X_toxic), axis=0) 
target = np.concatenate( (target_nontoxic[:numtrainingsamples],target_toxic), axis=0) 


original data:
#nontoxic =  141320
#toxic =  18366


## Set up the gridsearch 

The gridsearch using GridSearchCV is pretty straightforward: it takes a single estimator object, a set of parameters to test and some train/test data, and then exhaustively trains and tests the estimator with every parameter value combination to determine the one that gives the best score. [See here for more about hyperparameter tuning and gridsearch.](http://scikit-learn.org/stable/modules/grid_search.html#grid-search)

What is an estimator object? That is just an object that implements the scikit-learn estimator interface, which includes all scikit learn classifiers and regressors, as well as the various data transformation objects (i.e., Normalizer, Standardizer, etc). The classifier I use, MultinomialNB, is an estimator object and I could just pass that into gridsearch if I only wanted to tune the parameters available for MultinomialNB (which aren't many). But what if I want to also tune parameters for the data transformation steps before the classifier? For example, TfidfVectorizer has a lot of interesting parameters to tweak - how do I use GridSearchCV to find optimal parameters for both TfidfVectorizer and MultinomialNB? Enter the Pipeline object!

The [Pipeline object](http://scikit-learn.org/stable/modules/pipeline.html#pipeline) is an estimator object that lets you chain multiple estimators so that you can transform data and and train a classifier in one step. You can chain as many estimators as you want and even use your own custom estimator objects. In this case, I'll pipeline the two estimators I've been using to classify the NLP data: TfidfVectorizer and MultinomialNB. Once I create the pipeline object containing these two estimators, I can then pass that to GridSearchCV and tune both at the same time.

Method:

- The general procedure is to run the gridsearch multiple times with different parameters and values to try and find the optimal parameter settings. You usually can't test all parameters and values in the same run (it could take a very long time to do everything at once), so you will be trying to test a few parameters at a time. 


- Define my own defaults. This step isn't necessary, but it's useful for setting parameters not being tuned to specific values that might not be the built-in defaults. For example I might find that removal of stop words is always better in TfidfVectorizer (not removing stop words is the normal default), in which case I set the stop_words param to "english".


- Create the Pipeline object containing the estimator objects I want to test (TfidfVectorizer and MultinomialNB).


- Create the parameter set to test. In this case, all of the parameters I want to test are with TfidfVectorizer, since MultinomialNB has very few. In this example, I provide all of the relevant parameters for TfidfVectorizer, but leave most of them commented out. This way I can try grid searches with different combos of parameters by simply leaving the ones I want uncommented. Try out some different parameters yourself ([see the documentation for TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for parameter details). Note the naming convention for gridsearch parameters: "pipeline object name"\_\_"object parameter name"


- Create the GridSearchCV object, passing it the pipeline, the parameter set, and some train/test data. Note the parameter "n_jobs=-1": this tells GridSearchCV to start multiple processing threads to speed up the search, using all available processor cores on your computer (by default, it doesn't do this). Beware that your computer may get very busy while it is running gridsearch!

In [27]:
from time import time
import logging

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Tfidf vectorizer:
# define defaults: doing it this way allows us to define our own default params
tfidfargs = {
    "analyzer":'word', 
    "max_features" : None,
    "max_df" : 0.25, # Filters out terms that occur in more than half of the docs (max_df=0.5)
    "min_df" : 2, # Filters out terms that occur in only one document (min_df=2).
    "ngram_range":(1, 3), # unigrams
    "stop_words" : "english", # None, # "english", # Strips out “stop words”
    "use_idf" : True
    }

# Define a pipeline combining a text vectorizer with a Naive Bayes classifier
pipeline = Pipeline([    
    ('tfidf', TfidfVectorizer(**tfidfargs)),
    ('clf', MultinomialNB()),
])

# Define the parameters and values we want to test.
# Uncommenting more parameters will give better exploring power but will
#   increase processing time in a combinatorial way. I suggest tuning <= 3
#   parameters at a time.
# Note the naming format: pipelineobjectname__paramname
parameters = {
    'tfidf__stop_words': ('english', None),
    #'tfidf__analyzer': ('word', 'char_wb'),
    #'tfidf__analyzer': ('word', 'char', 'char_wb'),
    #'tfidf__max_df': (0.1, 0.25, 0.5, 0.75),
    #'tfidf__min_df': (1,2,5),
    #'tfidf__max_features': (None, 5000, 10000, 50000),
    'tfidf__ngram_range': ((1, 1), (1, 2), (2, 2), (1, 3), (3, 3)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
}

# create grid search object to find the best parameters for both the 
#   feature extraction and the classifier.
# Note: n_jobs=-1 causes GridSearchCV to use multithreading to employ all processor cores.
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

## Perform the grid search

The gridsearch takes the pipeline object (containing the text vectorizer and the multiNB classifier) and the data and tries all combos of the parameters we have defined. The output parameter "best\_estimator\_" contains a pipeline object with the parameters that give best performance.

In [28]:
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)
t0 = time()
grid_search.fit(X_text, target)
print("done in %0.3fs" % (time() - t0))
print()

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['tfidf', 'clf']
parameters:
{'tfidf__stop_words': ('english', None), 'tfidf__ngram_range': ((1, 1), (1, 2), (2, 2), (1, 3), (3, 3))}
Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  1.1min finished


done in 68.492s

Best score: 0.865
Best parameters set:
	tfidf__ngram_range: (1, 1)
	tfidf__stop_words: None


## Test the best classifier using k-folds cross validation



cross_validate_classifier uses k-folds cross validation to partition the data into multiple non-overlapping train/test sets, and run the classifier on each. If the classifier is solid, the results should be the same for all sets. If the classifier has problems - for example it is overfitting, then you will see variation.

Normally, you would spend a lot of time doing the first sort of training, tweaking, etc and then periodically use cross-validation as a "reality check" to verify that your model is robust.

In [29]:
from sklearn.model_selection import KFold, StratifiedKFold

# helper function to report accuracy results of a prediction run
def print_prediction_results(y_est, y_target):
    
    print("Classifier results:")
    
    print("\ttest set: #non-toxic = %d = %2.0f%%,  #toxic = %d = %2.0f%%"%(
        y_est[y_target==0].size, 100*y_est[y_target==0].size/y_est.size,
        y_est[y_target==1].size, 100*y_est[y_target==1].size/y_est.size) )          

    print("\taccuracy all =    \t%d/%d = %2.1f%%"%(
        (y_est == y_target).sum(), 
        y_est.size,
        100*(y_est == y_target).sum() / y_est.size))

    print("\taccuracy non-toxic = \t%d/%d = %2.1f%%"%(
        (y_est[y_target==0] == 0).sum(),
        y_est[y_target==0].size,
        100*(y_est[y_target==0] == 0).sum() / y_est[y_target==0].size))

    print("\taccuracy toxic = \t%d/%d = %2.1f%%"%(
        (y_est[y_target==1] == 1).sum(), 
        y_est[y_target==1].size,
        100*(y_est[y_target==1] == 1).sum() / y_est[y_target==1].size))
    

# cross-validation of classifier model with text string data X_text, category labels in y
def cross_validate_classifier(clf, X_text, y):

    # set up kfold to generate several train-test sets, 
    #  with shuffled indices for selecting from data
    kf = StratifiedKFold(n_splits=5, shuffle=True)

    i = 1
    accuracy = []
    for train_index, test_index in kf.split(X_text, y):
        print("\nk-fold train/test set #%d: "%(i))

        # fit the classifier with training data
        clf.fit(X_text[train_index], y[train_index])

        # generate predictions for test data
        y_est = clf.predict(X_text[test_index])

        # print results of the prediction test
        print_prediction_results(y_est, y[test_index])

        accuracy.append((y_est == y[test_index]).sum() / y_est.size)
        i += 1

    print("\nOverall accuracy = %2.1f%%"%(np.mean(accuracy)*100))

In [30]:

# cross-validate MultinomialNB classifier using nonoverlapping subsets of data
print("\n***************************")
print("Cross-validate classifier:")
cross_validate_classifier(grid_search.best_estimator_, X_text, target)


***************************
Cross-validate classifier:

k-fold train/test set #1: 
Classifier results:
	test set: #non-toxic = 3674 = 50%,  #toxic = 3674 = 50%
	accuracy all =    	6445/7348 = 87.7%
	accuracy non-toxic = 	3282/3674 = 89.3%
	accuracy toxic = 	3163/3674 = 86.1%

k-fold train/test set #2: 
Classifier results:
	test set: #non-toxic = 3673 = 50%,  #toxic = 3673 = 50%
	accuracy all =    	6467/7346 = 88.0%
	accuracy non-toxic = 	3308/3673 = 90.1%
	accuracy toxic = 	3159/3673 = 86.0%

k-fold train/test set #3: 
Classifier results:
	test set: #non-toxic = 3673 = 50%,  #toxic = 3673 = 50%
	accuracy all =    	6446/7346 = 87.7%
	accuracy non-toxic = 	3279/3673 = 89.3%
	accuracy toxic = 	3167/3673 = 86.2%

k-fold train/test set #4: 
Classifier results:
	test set: #non-toxic = 3673 = 50%,  #toxic = 3673 = 50%
	accuracy all =    	6420/7346 = 87.4%
	accuracy non-toxic = 	3292/3673 = 89.6%
	accuracy toxic = 	3128/3673 = 85.2%

k-fold train/test set #5: 
Classifier results:
	test set: #