<a href="https://colab.research.google.com/github/trippzac/ToxicCommentClassification/blob/LogisticRegressionModel/LogisticRegressionModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Loading data and setting up environment

In [55]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [56]:
cd /content/drive/MyDrive/FourthBrain/IndependentProject/

/content/drive/MyDrive/FourthBrain/IndependentProject


In [57]:
import pandas as pd
import numpy as np

#load in data sets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test_labels = pd.read_csv('test_labels.csv')

Let's get a preview of the data.

In [58]:
train.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


We combine the test data with its labels in order to classify it later on. We display the first 10 rows afterwards.

In [59]:
labeled_test = pd.merge(test, test_labels, on=['id', 'id'])
labeled_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,":If you have a look back at the source, the in...",-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,I don't anonymously edit articles at all.,-1,-1,-1,-1,-1,-1
5,0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0
6,00024115d4cbde0f,Please do not add nonsense to Wikipedia. Such ...,-1,-1,-1,-1,-1,-1
7,000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0
8,00025358d4737918,""" \n Only a fool can believe in such numbers. ...",-1,-1,-1,-1,-1,-1
9,00026d1092fe71cc,== Double Redirects == \n\n When fixing double...,-1,-1,-1,-1,-1,-1


We notice that there are many rows with -1 as the labels. These were not used in the scoring of the Kaggle competition, and we dispense of them since they are not labeled.

In [60]:
reduced_test = labeled_test[(labeled_test['toxic'] != -1) & 
                            (labeled_test['severe_toxic'] != -1) & 
                            (labeled_test['obscene'] != -1) & 
                            (labeled_test['threat'] != -1) & 
                            (labeled_test['insult'] != -1) & 
                            (labeled_test['identity_hate'] != -1)].reset_index().iloc[:,1:]
reduced_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0
1,000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0
2,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0,0,0,0,0,0
3,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0,0,0,0,0,0
4,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0,0,0,0,0,0
5,000663aff0fffc80,this other one from 1897,0,0,0,0,0,0
6,000689dd34e20979,== Reason for banning throwing == \n\n This ar...,0,0,0,0,0,0
7,000844b52dee5f3f,|blocked]] from editing Wikipedia. |,0,0,0,0,0,0
8,00091c35fa9d0465,"== Arabs are committing genocide in Iraq, but ...",1,0,0,0,0,0
9,000968ce11f5ee34,Please stop. If you continue to vandalize Wiki...,0,0,0,0,0,0


We now determine what proportion of comments in both the training and test sets positive examples for each category of toxicity.

In [61]:
print('Training set comments that are toxic by category:\n',
      '='*80,'\n', pd.DataFrame({'Proportion': np.mean(train.iloc[:,2:],axis = 0),
                                 'Number': np.sum(train.iloc[:,2:], axis=0)}),
      '\n\n\n', sep='')
print('Test set comments that are toxic by category:\n',
      '='*80,'\n', pd.DataFrame({'Proportion': np.mean(reduced_test.iloc[:,2:],axis = 0),
                                 'Number': np.sum(reduced_test.iloc[:,2:], axis=0)}),
      '\n', sep='')

Training set comments that are toxic by category:
               Proportion  Number
toxic            0.095844   15294
severe_toxic     0.009996    1595
obscene          0.052948    8449
threat           0.002996     478
insult           0.049364    7877
identity_hate    0.008805    1405



Test set comments that are toxic by category:
               Proportion  Number
toxic            0.095189    6090
severe_toxic     0.005736     367
obscene          0.057692    3691
threat           0.003298     211
insult           0.053565    3427
identity_hate    0.011129     712



It appears that the proportions are similar in each category for the training and test sets. However, the data is not balanced, so we will use weighting when training later on.

#Cleaning data


First, we define a function to clean-up the comments and return a list of words in the comment.

In [66]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
nltk.download('wordnet')
nltk.download('stopwords')

'''
Input
---------
comment: string

Output
---------
string: words from comment are converted to lowercase, stripped of most 
non-alphabetic characters, lemmatized, discarded if less than 3
characters, and returned as a whitespace separated string

Note(s)
---------
N/A
'''
def clean_string(comment):
  #get list of stop words to exclude from data set
  stop_words = set(stopwords.words('english'))
  #convert to lowercase
  comment = comment.lower()
  #split into array of words
  words = comment.split()
  #strip words of leading or trailing characters
  words = [word.strip('~`!@#$%^&*()_-+=\|[{]};:\'\",<.>/?/*0123456789') for word in words]
  #lemmatize words
  words = [WordNetLemmatizer().lemmatize(word) for word in words]
  #get rid of words less than 3 characters and words in stopwords
  #returns string separated by whitespace
  to_return = ''
  for word in words:
    if len(word) > 2 and word not in stop_words:
      to_return += word + ' '
  return to_return

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Now, we edit our comments in the training and test sets via the above function.

In [67]:
#create copies to make it easier to edit later
mod_train = train.copy()
mod_test = reduced_test.copy()

#apply clean_string to the comments
mod_train['comment_text'] = mod_train['comment_text'].apply(clean_string)
mod_test['comment_text'] = mod_test['comment_text'].apply(clean_string)

Let's view some of the edited comments from each set.

In [68]:
print('Modified training set:\n', mod_train['comment_text'].head(10), '\n\n')
print('Modified test set:\n', mod_test['comment_text'].head(10), '\n\n')

Modified training set:
 0    explanation edits made username hardcore metal...
1    d'aww match background colour i'm seemingly st...
2    hey man i'm really trying edit war guy constan...
3    can't make real suggestion improvement wondere...
4                sir hero chance remember page that's 
5              congratulation well use tool well talk 
6                         cocksucker piss around work 
7    vandalism matt shirvington article reverted pl...
8    sorry word nonsense offensive anyway i'm inten...
9                alignment subject contrary dulithgow 
Name: comment_text, dtype: object 


Modified test set:
 0    thank understanding think highly would revert ...
1                              dear god site horrible 
2    somebody invariably try add religion really me...
3    say right type type institution needed case th...
4    adding new product list make sure relevant add...
5                                                 one 
6    reason banning throwing article ne

#Vectorizing and Modeling

We begin by importing necessary libraries for representing our data and creating a vector representation of our data. In this version, we begin with TfidfVectorizer from sklearn. We will simply input the modified training data, let it create the vector representation, and then use it to transform our test data as well in order to make predictions.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer

#instantiate vectorizer with limit on number of features
vectorizer = TfidfVectorizer(max_features = 100000)
#fit vectorizer to training data
X_train = vectorizer.fit_transform(mod_train['comment_text'].values)

Now, we transform the test data using our fitted model.

In [70]:
X_test = vectorizer.transform(mod_test['comment_text'].values)

We check to see what the shapes of the resulting matrices are.

In [71]:
print('Training data now has the shape: ', X_train.shape)
print('Test data now has the shape: ', X_test.shape)

Training data now has the shape:  (159571, 100000)
Test data now has the shape:  (63978, 100000)


##Logistic Regression
We start with the most basic classification model, namely logistic regression.

In [72]:
from sklearn.linear_model import LogisticRegression

#Instantiate logistic regression with a fixed random state and balanced class
#weights because of unbalanced data set
log_reg = LogisticRegression(random_state=0, class_weight='balanced', max_iter=1000)

Get train and test targets.

In [73]:
y_train = mod_train.iloc[:,2:].values
y_test = mod_test.iloc[:,2:].values

For each target, we run a grid cross-validation search to find the best parameters. Currently, the two parameters being tuned are C (the regularization parameter) and the penalty being applied. Note that for ElasticNet, we are currently only using the default l1_ratio. 

In [74]:
log_reg.get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [75]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

#create a grid of parameters for C (we may optimize on penalty later if saga
#solver will run fast enough)
param_grid = {'C': [.03, .1, .3, 1, 3]}

print('Logistic Regression Predictions\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
clf = GridSearchCV(log_reg, param_grid, scoring='roc_auc')
for i in range(y_train.shape[1]):
  print('Best parameters for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  clf.fit(X_train, y_train[:,i])
  print(clf.best_params_, '\n\n')
  y_pred[:,i] = clf.predict(X_test)
print('Overall score:', roc_auc_score(y_test, y_pred))

Logistic Regression Predictions


Best parameters for toxic comments
--------------------------------------------------------------------------------
{'C': 1} 


Best parameters for severe_toxic comments
--------------------------------------------------------------------------------
{'C': 0.1} 


Best parameters for obscene comments
--------------------------------------------------------------------------------
{'C': 1} 


Best parameters for threat comments
--------------------------------------------------------------------------------
{'C': 0.1} 


Best parameters for insult comments
--------------------------------------------------------------------------------
{'C': 1} 


Best parameters for identity_hate comments
--------------------------------------------------------------------------------
{'C': 0.3} 


Overall score: 0.9161478469936725


In the Kaggle competition, scoring was based on the mean column-wise area under the receiver operating characteristic curve, so we will display this metric for each of our models.

Now, let's see if we can determine what types of errors are occuring in order data by running a classification report and then pulling some of the misclassified data and inspecting it by hand.

In [76]:
from sklearn.metrics import classification_report

for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))

Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.87      0.93     57888
           1       0.43      0.92      0.58      6090

    accuracy                           0.88     63978
   macro avg       0.71      0.89      0.76     63978
weighted avg       0.94      0.88      0.89     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.95      0.98     63611
           1       0.10      0.93      0.19       367

    accuracy                           0.95     63978
   macro avg       0.55      0.94      0.58     63978
weighted avg       0.99      0.95      0.97     63978
 


Classification statistics for obscene comments
----------------------------------

It appears that the model has quite low precision, particularly for threat and severe_toxic comments. In our training data, these comments and identity_hate comments have particularly low sample sizes, so we may need to do some data engineering or use another type of algorithm to boost them.

Let's first view some of the misclassified data for threat and severe_toxic comments.

In [77]:
#first, we create a DataFrame of our test set with the predicted labels added on
test_with_pred_labels = mod_test.copy()
#we add the column names, along with an initial p_ for predicted
new_cols = 'p_' + test_with_pred_labels.columns[2:]
#add columns
test_with_pred_labels[new_cols] = y_pred

#now, we retrieve the misclassified threats and print out some of the data
misclassified_indices_threat = (y_test[:,3] != y_pred[:,3])
#print out examples (for best viewing)
for i in range(20):
  print(test_with_pred_labels.comment_text[misclassified_indices_threat].iloc[i], '\n\n')

burn hell revoke talk page access  


black mamba it.is ponious snake word kill many people king cobra kill many people india  


trust trust checkuser work eat burn guarantee hate seriously  


i'll  


hell  


drunk made gibson say really belief little freak nazi like father hope lung cancer soon  


still shouldnt kill woman reason like christains soppose good people  


said going longgggg one  


think delete whole fuckin wikipedia  


cite policy wasting fucking time cite policy i'll strike bastard better good fucking idea complaint say wasting goddamn time opinion worthless matter policy guideline source give pile steaming dogshit manner sounding nonsense bullshit understanding community whole fuck lot ruder opinion jesus christ stop polite-trolling learn fucking rule wikipedia's rules:simple/complex  


removal band several band list may thrash metal band either black death posting band comment whether thrash genre bracket disputed desaster black deströyer blackened thrash div

Let's additionally view how these comments are being labeled in the other categories and what the most common words are in these comments.

In [78]:
test_with_pred_labels[misclassified_indices_threat].iloc[:20,:]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,p_toxic,p_severe_toxic,p_obscene,p_threat,p_insult,p_identity_hate
27,0016b94c8b20ffa6,burn hell revoke talk page access,0,0,0,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0
61,00372dfdd531fc6c,black mamba it.is ponious snake word kill many...,0,0,0,0,0,0,1.0,0.0,0.0,1.0,0.0,0.0
148,009ba60f01432c26,trust trust checkuser work eat burn guarantee ...,0,0,0,0,0,0,1.0,0.0,0.0,1.0,1.0,0.0
164,00ad0323d9b31f21,i'll,0,0,0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0
172,00b3813b966af7e8,hell,1,0,0,0,0,0,1.0,1.0,1.0,1.0,1.0,0.0
184,00bd66c9ef023f41,drunk made gibson say really belief little fre...,1,0,0,0,0,0,1.0,1.0,1.0,1.0,1.0,1.0
280,011b51b386c60373,still shouldnt kill woman reason like christai...,0,0,0,0,0,0,1.0,0.0,0.0,1.0,0.0,1.0
291,0123f7c07bedbf0d,said going longgggg one,0,0,0,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0
307,0137ae56816b36fa,think delete whole fuckin wikipedia,1,0,1,0,1,0,1.0,1.0,1.0,1.0,1.0,1.0
375,0173dd710621e443,cite policy wasting fucking time cite policy i...,1,0,1,0,1,0,1.0,1.0,1.0,1.0,1.0,0.0


It appears that most errors are not due to misrecognizing words but instead not being able to distinguish between the different categories. Perhaps a sentiment analysis would improve this. The lemmatization and other minor changes to preprocessing above have added about .01 to our ROC AUC score, which is a good improvement. Currently, there is not a clear way to improve the preprocessing given that the biggest issue is overidentifying threats, severe_toxic comments, etc.

Let's give a summary of some of the words that are being misclassified. We will continue to focus on the threats for now.

In [79]:
from nltk.probability import FreqDist

#first, create list of all words in misclassified comments
misclassified_words = []
for word in test_with_pred_labels['comment_text']:
  misclassified_words += word.split()
#find a frequency distribution of the words
fdist = FreqDist(misclassified_words)
#print 30 most common words
print('Word                Count\n', '-'*80, sep='')
for word, count in fdist.most_common(30):
  print(word, '.'*(20-len(word)), count, sep='')


Word                Count
--------------------------------------------------------------------------------
article.............26976
page................17888
would...............11226
one.................11096
wikipedia...........10924
like................10791
please..............9657
source..............8333
think...............8290
see.................7972
also................7410
know................7064
i'm.................7039
people..............6717
time................6665
make................5959
fuck................5930
use.................5821
talk................5772
say.................5742
edit................5574
may.................5534
need................5413
get.................5223
name................4924
section.............4871
thanks..............4795
even................4675
doe.................4651
good................4584


Interestingly, many of the words appear to revolve around related concepts like "article", "page", "wikipedia", "source", etc. 

#To do



1. Consider doing some data engineering to add more non-toxic examples related to wikipedia and other commonly misclassified words above.
2. Create a deep learning model, perhaps using BERT.
