

# NLP pre-processing and classification using Multinomial Naive-Bayes 

This notebook is intended as an example of how to load, pre-process and classify the toxicity scored text comment data provided for the 2018 Portland Data Science Group NLP workshop. The code here uses a Multinomial Naive-Bayes classifier, which is commonly used for NLP classification tasks (though it is by no means the only one!). Additionally, I explore the effect of adjusting the number of training samples of each class (toxic vs non-toxic) on classifer accuracy.

Setup:
- Load data downloaded from http://dive-into.info/ - The data is expected to be in the same folder as this notebook.
- Download and install nltk content (optional)
- Combine comment and toxicity score data and generate toxicity categories (toxic vs non-toxic) for classifier training and prediction.

Text pre-processing:
- Clean up text by dropping non-alpha characters.
- Drop words < 3 chars.
- Use a word stemmer to stem the words.

Classifier training:
- Use TfidfVectorizer to create word count vectors, and then apply TDF/IDF algorithm to weight word vectors.
- Fit (train) MultinomialNB classifier with vectorized word data. 
- Test classifier.
- RUn Kfolds cross-validation to test robustness of the classifier.



## Set up the notebook plot environment, import some basic modules, and load the data.

Notes:

- This block also does a minimal bit of cleanup, by removing the embedded text "NEWLINE_TOKEN" and "TAB_TOKEN".

In [31]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

# set matplotlib environment and import some basics
%matplotlib inline
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 100 # set to -1 to see entire text

# ******************************************************
# load the wikipedia toxicity data provided by Matt
# ******************************************************
# set True to load the smaller data set, False to load the large data set
# NOTE: this code assumes data files are in the same folder as the notebook.
if False:
    # comment filename
    commentfile = 'toxicity_annotated_comments_unanimous.tsv'
    # rating filename
    ratingfile = 'toxicity_annotations_unanimous.tsv'

# full data set
else:
    # comment filename
    commentfile = 'toxicity_annotated_comments.tsv'
    # rating filename
    ratingfile = 'toxicity_annotations.tsv'

# load annotated comments
comments = pd.read_table(commentfile)
ratings = pd.read_table(ratingfile)

# remove weird tab/newline TOKEN text
comments['comment'] = comments['comment'].str.replace('NEWLINE_TOKEN','\n')
comments['comment'] = comments['comment'].str.replace('TAB_TOKEN','')

# show shape of each data set
print("comments.shape = ",comments.shape)
print("ratings.shape = ",ratings.shape)


comments.shape =  (159686, 7)
ratings.shape =  (1598289, 4)


### Import NLTK components

NLTK is installed in Jupyter by default, but you still need to download the corpus data many of the tools use to process text data. This only needs to be done once. I've commented out the code line that does this, so the first time you run this code, you may need to uncomment it and run this cell, then comment again since you only need to do it once. NLTK will open a dialog allowing you to select what is downloaded - I chose all, but probably "most popular" will suffice. 

In [32]:
import nltk

# uncomment this to download nltk. This needs to be done once, and takes a while. 
#  I downloaded everything, but probably the "popular packages" will suffice.
#nltk.download()

## Combine comments and scores into one dataset

Use Pandas groupby function to calculate the mean and median for each comment, and add them as columns to the comment dataframe. Now I have comments and two measures of score aligned.

Next, I create a new toxicity categorical variable (0=not toxic, 1=toxic) by thresholding the median score at 0. I use the median score here because it is less sensitive to outlier scores than the mean.

Note that I don't use the mean or median scores beyond this point - the Naive Bayes classifier wants a categorical variable. However, you could potentially do some other interesting things with these scores, including implement a different classifier that makes use of score data - just sayin.

In [33]:
scoredcomments = comments.copy()
# group all scores by comment ID for each text sample, add mean and median score columns to comment data 
scoredcomments["mean_score"] = pd.Series(ratings.groupby("rev_id",as_index=False).mean()["toxicity_score"])
scoredcomments["median_score"] = pd.Series(ratings.groupby("rev_id",as_index=False).median()["toxicity_score"])

# create catgorical variable toxicity: if median score < 0, toxicity=1, otherwise 0
scoredcomments["toxicity"] = (scoredcomments["median_score"] < 0).astype(int)

# make the comment id s ints
scoredcomments.rev_id = np.int64(scoredcomments.rev_id)

print("scoredcomments.shape = ",scoredcomments.shape)
scoredcomments.head()

scoredcomments.shape =  (159686, 10)


Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,mean_score,median_score,toxicity
0,2232,This:\n:One can make an analogy in mathematical terms by envisioning the distribution of opinion...,2002,True,article,random,train,0.4,0.5,0
1,4216,"`\n\n:Clarification for you (and Zundark's right, i should have checked the Wikipedia bugs page...",2002,True,user,random,train,0.5,0.0,0
2,8953,Elected or Electoral? JHK,2002,False,article,random,test,0.1,0.0,0
3,26547,`This is such a fun entry. Devotchka\n\nI once had a coworker from Korea and not only couldn't...,2002,True,article,random,train,0.6,0.0,0
4,28959,"Please relate the ozone hole to increases in cancer, and provide figures. Otherwise, this articl...",2002,True,article,random,test,0.2,0.0,0


## Clean up the text

- Remove non alpha chars (numbers, etc)
- Drop words less than 3 chars
- Stem the words



In [34]:
%%time
import re
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords as sw

#stemmer = PorterStemmer() # alternate stemmer
stemmer = SnowballStemmer(language='english')

# set up regex expression to remove all but alpha chars and whitespace
regex = re.compile('[^a-zA-Z\s]') 

numsamples = scoredcomments.comment.shape[0]

# set minimum word size. Words with fewer characters are dropped. 
#  I do this because there are a lot of 2 char initials in the comment data, which I think aren't useful,
#   ... or maybe they are - adjust this and see what happens!
minwordsize = 3

print("Processing %d samples:"%(numsamples))

# transform each sample text:
stemmed_text = []
for text,i in zip(scoredcomments.comment,range(numsamples)):
    # set to lower case
    text = regex.sub('',text.lower())
    # look at each word in text
    t = []
    for word in word_tokenize(text):
        # drop "words" that are too short or long (otherwise stem crashes!)
        if len(word) >= minwordsize and len(word) < 30: 
            t.append(stemmer.stem(word)) # stem the added word
    stemmed_text.append(" ".join(t)) # re-combine list of stemmed words
    if not i%5000: print(i,',', end="")

stemmed_text = pd.Series(np.array(stemmed_text)) # convert list of sample texts to pandas series
    
print("\n\nstemmed_text:\n",stemmed_text[:3])


Processing 159686 samples:
0 ,5000 ,10000 ,15000 ,20000 ,25000 ,30000 ,35000 ,40000 ,45000 ,50000 ,55000 ,60000 ,65000 ,70000 ,75000 ,80000 ,85000 ,90000 ,95000 ,100000 ,105000 ,110000 ,115000 ,120000 ,125000 ,130000 ,135000 ,140000 ,145000 ,150000 ,155000 ,

stemmed_text:
 0    this one can make analog mathemat term envis the distribut opinion popul gaussian curv would the...
1    clarif for you and zundark right should have check the wikipedia bug page first this bug the cod...
2                                                                                      elect elector jhk
dtype: object
Wall time: 2min 44s


## Train and test a MultinomialNB classifier with text data

I present two methods here that vary only in how the training and testing data are selected.

- train_test_MultinomialNB uses all of the text data to train and test the classifier one time. This method is suitable for a first attempt to just get the thing working and then to do parameter tuning (which I don't do here). 



- cross_validate_MultinomialNB uses k-folds cross validation to partition the data into multiple non-overlapping train/test sets, and run the classifier on each. If the classifier is solid, the results should be the same for all sets. If the classifier has problems - for example it is overfitting, then you will see variation.

Normally, you would spend a lot of time doing the first sort of training, tweaking, etc and then periodically use cross-validation as a "reality check" to verify that your model is robust.


In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import KFold, StratifiedKFold

# helper function to report accuracy results of a prediction run
def print_prediction_results(y_est, y_target):
    
    print("Classifier results:")
    
    print("\ttest set: #non-toxic = %d = %2.0f%%,  #toxic = %d = %2.0f%%"%(
        y_est[y_target==0].size, 100*y_est[y_target==0].size/y_est.size,
        y_est[y_target==1].size, 100*y_est[y_target==1].size/y_est.size) )          

    print("\taccuracy all =    \t%d/%d = %2.1f%%"%(
        (y_est == y_target).sum(), 
        y_est.size,
        100*(y_est == y_target).sum() / y_est.size))

    print("\taccuracy non-toxic = \t%d/%d = %2.1f%%"%(
        (y_est[y_target==0] == 0).sum(),
        y_est[y_target==0].size,
        100*(y_est[y_target==0] == 0).sum() / y_est[y_target==0].size))

    print("\taccuracy toxic = \t%d/%d = %2.1f%%"%(
        (y_est[y_target==1] == 1).sum(), 
        y_est[y_target==1].size,
        100*(y_est[y_target==1] == 1).sum() / y_est[y_target==1].size))
    

# train and test MultinomialNB with text string data X_text, category labels in y
def train_test_MultinomialNB(X_text, y):
    
    # Tfidf vectorizer: vectorize the comment texts, and apply TF-IDF weighting
    # Note that there are a bunch of parameter options, but I just use defaults here.
    print("\nVectorizing text data...")
    X_vectors = TfidfVectorizer().fit_transform(X_text)
    print("X_vectors.shape = ",X_vectors.shape)

    # create some training and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X_vectors, y, test_size=0.25, random_state=42)

    # create and fit a niave baysian classifier to the training data
    clf = MultinomialNB().fit(X_train, y_train)

    # generate predictions for test data
    y_est = clf.predict(X_test)
    
    # print results of the prediction test
    print_prediction_results(y_est, y_test)
    

# cross-validation of MultinomialNB with text string data X_text, category labels in y
def cross_validate_MultinomialNB(X_text, y):

    # Tfidf vectorizer: vectorize the comment texts, and apply TF-IDF weighting
    # Note that there are a bunch of parameter options, but I just use defaults here.
    print("\nVectorizing text data...")
    X_vectors = TfidfVectorizer().fit_transform(X_text)
    print("X_vectors.shape = ",X_vectors.shape)

    # set up kfold to generate several train-test sets, 
    #  with shuffled indices for selecting from data
    kf = StratifiedKFold(n_splits=5, shuffle=True)

    i = 1
    accuracy = []
    for train_index, test_index in kf.split(X_vectors, y):
        print("\nk-fold train/test set #%d: "%(i))

        # create and fit a niave baysian classifier to the training data
        clf = MultinomialNB().fit(X_vectors[train_index,:], y[train_index])

        # generate predictions for test data
        y_est = clf.predict(X_vectors[test_index,:])

        # print results of the prediction test
        print_prediction_results(y_est, y[test_index])

        accuracy.append((y_est == y[test_index]).sum() / y_est.size)
        i += 1

    print("\nOverall accuracy = %2.1f%%"%(np.mean(accuracy)*100))
 

## Train, test and cross validate the classifier

First, I look at results of training the classifier using all of the comment data, despite the fact that the number nontoxic comments is strongly over-represented (see counts of samples of each type in output).

In [36]:
# if you want, you can test the impact of my cleaning and stemming on results by passing the 
#   raw comments instead
# text = scoredcomments.comment
text = stemmed_text

# train and test MultinomialNB classifier using all data
print("\n***************************")
print("Train and test classifier:")
train_test_MultinomialNB(text, scoredcomments.toxicity)
    
# cross-validate MultinomialNB classifier using nonoverlapping subsets of data
print("\n***************************")
print("Cross-validate classifier:")
cross_validate_MultinomialNB(text, scoredcomments.toxicity)


***************************
Train and test classifier:

Vectorizing text data...
X_vectors.shape =  (159686, 171219)
Classifier results:
	test set: #non-toxic = 35346 = 89%,  #toxic = 4576 = 11%
	accuracy all =    	36099/39922 = 90.4%
	accuracy non-toxic = 	35324/35346 = 99.9%
	accuracy toxic = 	775/4576 = 16.9%

***************************
Cross-validate classifier:

Vectorizing text data...
X_vectors.shape =  (159686, 171219)

k-fold train/test set #1: 
Classifier results:
	test set: #non-toxic = 28264 = 88%,  #toxic = 3674 = 12%
	accuracy all =    	28911/31938 = 90.5%
	accuracy non-toxic = 	28243/28264 = 99.9%
	accuracy toxic = 	668/3674 = 18.2%

k-fold train/test set #2: 
Classifier results:
	test set: #non-toxic = 28264 = 88%,  #toxic = 3673 = 12%
	accuracy all =    	28927/31937 = 90.6%
	accuracy non-toxic = 	28249/28264 = 99.9%
	accuracy toxic = 	678/3673 = 18.5%

k-fold train/test set #3: 
Classifier results:
	test set: #non-toxic = 28264 = 88%,  #toxic = 3673 = 12%
	accuracy a

## Equalize #samples of toxic vs non-toxic data


In the previous result, you will notice that the classifier is super good at classifying non-toxic comments, but really sucks at classifying toxic comments. This is a problem if you want the classifier to detect toxic comments! Why does this happen? One possibility is that, since most of the training data are non-toxic, the classifier is being trained to  have a bias toward classifying most comments as non-toxic, because this gives the highest overall accuracy.

So what if I equalize the number of non-toxic and toxic comments passed to the classifier for training?

In the results below, look at the relative gain in accuracy at detecting toxic comments vs the loss of accuracy at detecting non-toxic comments - not a bad tradeoff!


In [37]:
# number of samples to generate for each text category
numtrainingsamples = np.sum(scoredcomments.toxicity==1)

#text = commentdata.comment # use this to work with un-modified comment data
text = np.array(stemmed_text) # use this to work with the cleaned and stemmed comment data

# split the data by category.
ind, = np.where(scoredcomments.toxicity==0)
X_nontoxic = text[ind]
target_nontoxic = scoredcomments.toxicity.values[ind]
ind, = np.where(scoredcomments.toxicity==1)
X_toxic = text[ind]
target_toxic = scoredcomments.toxicity[ind]

print("original data:")
print("#nontoxic = ",X_nontoxic.size)
print("#toxic = ",X_toxic.size)

# recombine the data with equalized number of samples of each category
X_text = np.concatenate( (X_nontoxic[:numtrainingsamples],X_toxic), axis=0) 
target = np.concatenate( (target_nontoxic[:numtrainingsamples],target_toxic), axis=0) 

# train and test MultinomialNB classifier using all data
print("\n***************************")
print("Train and test classifier:")
train_test_MultinomialNB(X_text, target)
    
# cross-validate MultinomialNB classifier using nonoverlapping subsets of data
print("\n***************************")
print("Cross-validate classifier:")
cross_validate_MultinomialNB(X_text, target)
    

original data:
#nontoxic =  141320
#toxic =  18366

***************************
Train and test classifier:

Vectorizing text data...
X_vectors.shape =  (36732, 58715)
Classifier results:
	test set: #non-toxic = 4578 = 50%,  #toxic = 4605 = 50%
	accuracy all =    	8063/9183 = 87.8%
	accuracy non-toxic = 	4215/4578 = 92.1%
	accuracy toxic = 	3848/4605 = 83.6%

***************************
Cross-validate classifier:

Vectorizing text data...
X_vectors.shape =  (36732, 58715)

k-fold train/test set #1: 
Classifier results:
	test set: #non-toxic = 3674 = 50%,  #toxic = 3674 = 50%
	accuracy all =    	6450/7348 = 87.8%
	accuracy non-toxic = 	3364/3674 = 91.6%
	accuracy toxic = 	3086/3674 = 84.0%

k-fold train/test set #2: 
Classifier results:
	test set: #non-toxic = 3673 = 50%,  #toxic = 3673 = 50%
	accuracy all =    	6451/7346 = 87.8%
	accuracy non-toxic = 	3356/3673 = 91.4%
	accuracy toxic = 	3095/3673 = 84.3%

k-fold train/test set #3: 
Classifier results:
	test set: #non-toxic = 3673 = 50%