<a href="https://colab.research.google.com/github/trippzac/ToxicCommentClassification/blob/main/InitialDataCleaningAndModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Loading data and setting up environment

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cd /content/drive/MyDrive/FourthBrain/IndependentProject/

/content/drive/MyDrive/FourthBrain/IndependentProject


In [None]:
import pandas as pd
import numpy as np

#load in data sets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test_labels = pd.read_csv('test_labels.csv')

Let's get a preview of the data.

In [None]:
train.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


We combine the test data with its labels in order to classify it later on. We display the first 10 rows afterwards.

In [None]:
labeled_test = pd.merge(test, test_labels, on=['id', 'id'])
labeled_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,":If you have a look back at the source, the in...",-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,I don't anonymously edit articles at all.,-1,-1,-1,-1,-1,-1
5,0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0
6,00024115d4cbde0f,Please do not add nonsense to Wikipedia. Such ...,-1,-1,-1,-1,-1,-1
7,000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0
8,00025358d4737918,""" \n Only a fool can believe in such numbers. ...",-1,-1,-1,-1,-1,-1
9,00026d1092fe71cc,== Double Redirects == \n\n When fixing double...,-1,-1,-1,-1,-1,-1


We notice that there are many rows with -1 as the labels. These were not used in the scoring of the Kaggle competition, and we dispense of them since they are not labeled.

In [None]:
reduced_test = labeled_test[(labeled_test['toxic'] != -1) & 
                            (labeled_test['severe_toxic'] != -1) & 
                            (labeled_test['obscene'] != -1) & 
                            (labeled_test['threat'] != -1) & 
                            (labeled_test['insult'] != -1) & 
                            (labeled_test['identity_hate'] != -1)].reset_index().iloc[:,1:]
reduced_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0
1,000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0
2,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0,0,0,0,0,0
3,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0,0,0,0,0,0
4,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0,0,0,0,0,0
5,000663aff0fffc80,this other one from 1897,0,0,0,0,0,0
6,000689dd34e20979,== Reason for banning throwing == \n\n This ar...,0,0,0,0,0,0
7,000844b52dee5f3f,|blocked]] from editing Wikipedia. |,0,0,0,0,0,0
8,00091c35fa9d0465,"== Arabs are committing genocide in Iraq, but ...",1,0,0,0,0,0
9,000968ce11f5ee34,Please stop. If you continue to vandalize Wiki...,0,0,0,0,0,0


We now determine what proportion of comments in both the training and test sets positive examples for each category of toxicity.

In [None]:
print('Training set comments that are toxic by category:\n',
      '='*80,'\n', pd.DataFrame({'Proportion': np.mean(train.iloc[:,2:],axis = 0),
                                 'Number': np.sum(train.iloc[:,2:], axis=0)}),
      '\n\n\n', sep='')
print('Test set comments that are toxic by category:\n',
      '='*80,'\n', pd.DataFrame({'Proportion': np.mean(reduced_test.iloc[:,2:],axis = 0),
                                 'Number': np.sum(reduced_test.iloc[:,2:], axis=0)}),
      '\n', sep='')

Training set comments that are toxic by category:
               Proportion  Number
toxic            0.095844   15294
severe_toxic     0.009996    1595
obscene          0.052948    8449
threat           0.002996     478
insult           0.049364    7877
identity_hate    0.008805    1405



Test set comments that are toxic by category:
               Proportion  Number
toxic            0.095189    6090
severe_toxic     0.005736     367
obscene          0.057692    3691
threat           0.003298     211
insult           0.053565    3427
identity_hate    0.011129     712



It appears that the proportions are similar in each category for the training and test sets. However, the data is not balanced, so we will use weighting when training later on.

#Cleaning data


First, we define a function to clean-up the comments and return a list of words in the comment.

In [None]:
'''
Input
---------
comment: string

Output
---------
string: words from comment are converted to lowercase and stripped of most 
non-alphabetic characters with empty words deleted, then returned as a single
string with space separators

Note(s)
---------
N/A
'''
def clean_string(comment):
  #convert to lowercase
  comment = comment.lower()
  #split into array of words
  words = comment.split()
  #iterate over all words, strip them of unneeded characters, and add to string
  #to be returned
  to_return = ''
  for i in range(len(words)):
    #get rid of leading or trailing characters
    words[i] = words[i].strip('~`!@#$%^&*()_-+=\|[{]};:\'\",<.>/?/*0123456789')
    #check if the resulting word is empty
    if words[i] != '':
      #add to string if not (with trailing empty space)
      to_return += words[i] + ' '
  return to_return

Now, we create a copy of our training data in which we replace comment_text with the simplified string of words.

In [None]:
mod_train = train.copy()
mod_train['comment_text'] = mod_train['comment_text'].apply(clean_string)
mod_train.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation why the edits made under my userna...,0,0,0,0,0,0
1,000103f0d9cfb60f,d'aww he matches this background colour i'm se...,0,0,0,0,0,0
2,000113f07ec002fd,hey man i'm really not trying to edit war it's...,0,0,0,0,0,0
3,0001b41b1c6bb37e,more i can't make any real suggestions on impr...,0,0,0,0,0,0
4,0001d958c54c6e35,you sir are my hero any chance you remember wh...,0,0,0,0,0,0
5,00025465d4725e87,congratulations from me as well use the tools ...,0,0,0,0,0,0
6,0002bcb3da6cb337,cocksucker before you piss around on my work,1,1,1,0,1,0
7,00031b1e95af7921,your vandalism to the matt shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,sorry if the word nonsense was offensive to yo...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


We now apply the same cleaning to the test data and check it.

In [None]:
mod_test = reduced_test.copy()
mod_test['comment_text'] = mod_test['comment_text'].apply(clean_string)
mod_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0001ea8717f6de06,thank you for understanding i think very highl...,0,0,0,0,0,0
1,000247e83dcc1211,dear god this site is horrible,0,0,0,0,0,0
2,0002f87b16116a7f,somebody will invariably try to add religion r...,0,0,0,0,0,0
3,0003e1cccfd5a40a,it says it right there that it is a type the t...,0,0,0,0,0,0
4,00059ace3e3e9a53,before adding a new product to the list make s...,0,0,0,0,0,0
5,000663aff0fffc80,this other one from,0,0,0,0,0,0
6,000689dd34e20979,reason for banning throwing this article needs...,0,0,0,0,0,0
7,000844b52dee5f3f,blocked from editing wikipedia,0,0,0,0,0,0
8,00091c35fa9d0465,arabs are committing genocide in iraq but no p...,1,0,0,0,0,0
9,000968ce11f5ee34,please stop if you continue to vandalize wikip...,0,0,0,0,0,0


#Vectorizing and Modeling

We begin by importing necessary libraries for representing our data and creating a vector representation of our data. In this version, we begin with TfidfVectorizer from sklearn. We will simply input the modified training data, let it create the vector representation, and then use it to transform our test data as well in order to make predictions.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

#instantiate vectorizer with stop_words to ignore commonly occuring words with 
#little value to the comment
vectorizer = TfidfVectorizer(max_df=0.5, stop_words='english')
#fit vectorizer to training data
X_train = vectorizer.fit_transform(mod_train['comment_text'].values)

Now, we transform the test data using our fitted model.

In [None]:
X_test = vectorizer.transform(mod_test['comment_text'].values)

We check to see what the shapes of the resulting matrices are.

In [None]:
print('Training data now has the shape: ', X_train.shape)
print('Test data now has the shape: ', X_test.shape)

Training data now has the shape:  (159571, 181325)
Test data now has the shape:  (63978, 181325)


##Logistic Regression
We start with the most basic classification model, namely logistic regression.

In [None]:
from sklearn.linear_model import LogisticRegression

#instantiate logistic regression with a fixed random state and balanced class
#weights because of unbalanced data set
log_reg = LogisticRegression(random_state=0, class_weight='balanced', max_iter=1000)

Get train and test targets.

In [None]:
y_train = mod_train.iloc[:,2:].values
y_test = mod_test.iloc[:,2:].values

For each target, we run a basic logistic regression and output the classification report.

In [None]:
from sklearn.metrics import classification_report

print('Logistic Regression Predictions\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = log_reg.fit(X_train, y_train[:,i])
  y_pred[:,i] = model.predict(X_test)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))

Logistic Regression Predictions


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.87      0.93     57888
           1       0.43      0.91      0.59      6090

    accuracy                           0.88     63978
   macro avg       0.71      0.89      0.76     63978
weighted avg       0.94      0.88      0.90     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.97      0.98     63611
           1       0.14      0.91      0.24       367

    accuracy                           0.97     63978
   macro avg       0.57      0.94      0.61     63978
weighted avg       0.99      0.97      0.98     63978
 


Classification statistics for obscene comments


In the Kaggle competition, scoring was based on the mean column-wise area under the receiver operating characteristic curve, so we will display this metric for each of our models.

In [None]:
from sklearn.metrics import roc_auc_score
print('ROC AUC score for logistic regression:', roc_auc_score(y_test, y_pred))

ROC AUC score for logistic regression: 0.9083097374636147


The score is 0.9083097374636147. To put this into perspective, there were 4539 contestants, and the score for 1st place (on the public leaderboard) was 0.98901, while 0.98692 was sufficient for a bronze. The current models I am creating will serve as a baseline to compare to future iterations of this project.

##SVM model

We now try to make a basic SVM model using LinearSVC from sklearn.

In [None]:
from sklearn.svm import LinearSVC

#we again balance our weights
svm_estimator = LinearSVC(random_state=0, class_weight='balanced', max_iter=1250)

We again fit this estimator to each type of toxicity and print classification statistics.

In [None]:
print('SVM Model Predictions\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = svm_estimator.fit(X_train, y_train[:,i])
  y_pred[:,i] = model.predict(X_test)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for SVM:', roc_auc_score(y_test, y_pred))

SVM Model Predictions


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.89      0.94     57888
           1       0.46      0.88      0.60      6090

    accuracy                           0.89     63978
   macro avg       0.72      0.88      0.77     63978
weighted avg       0.94      0.89      0.90     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------




              precision    recall  f1-score   support

           0       1.00      0.98      0.99     63611
           1       0.17      0.75      0.27       367

    accuracy                           0.98     63978
   macro avg       0.58      0.86      0.63     63978
weighted avg       0.99      0.98      0.98     63978
 


Classification statistics for obscene comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.95      0.97     60287
           1       0.52      0.83      0.63      3691

    accuracy                           0.95     63978
   macro avg       0.75      0.89      0.80     63978
weighted avg       0.96      0.95      0.95     63978
 


Classification statistics for threat comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      1

The score here is 0.8522568171983279.

##Dimensionality Reduction

In order to apply other methods in a reasonable amount of time, we will need to apply dimensionality reduction. To do so, we will use TruncatedSVD from sklearn.

In [None]:
from sklearn.decomposition import TruncatedSVD

#first, we create our transformers, starting with 10 components
svd = TruncatedSVD(n_components=10)
#we fit the model on X_train and transform both X_train and X_test
X_train_red = svd.fit_transform(X_train)
X_test_red = svd.transform(X_test)

We check to make sure that the dimensions have been reduced. We have started with 10 components due to trial and error in training the following two methods.

In [None]:
print('Dimensions of training data:', X_train_red.shape)
print('Dimensions of test data:', X_test_red.shape)

Dimensions of training data: (159571, 10)
Dimensions of test data: (63978, 10)


Let's get a measure of how much information is retained by this reduced data.

In [None]:
print('Explained Variance:', svd.explained_variance_ratio_.sum())

Explained Variance: 0.03195950919679924


##Gradient Boosted Trees

We now try gradient boosted trees and measure their performance in a similar manner.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

#instiate estimator with fixed random state
grad_boost_est = GradientBoostingClassifier(random_state = 0)

print('Gradient Boosted Trees Predictions\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = grad_boost_est.fit(X_train_red, y_train[:,i])
  y_pred[:,i] = model.predict(X_test_red)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for GBT:', roc_auc_score(y_test, y_pred))

Gradient Boosted Trees Predictions


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.94      0.98      0.96     57888
           1       0.70      0.41      0.52      6090

    accuracy                           0.93     63978
   macro avg       0.82      0.70      0.74     63978
weighted avg       0.92      0.93      0.92     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     63611
           1       0.21      0.17      0.19       367

    accuracy                           0.99     63978
   macro avg       0.60      0.59      0.59     63978
weighted avg       0.99      0.99      0.99     63978
 


Classification statistics for obscene commen

The score here is 0.612971532491109, which is signficantly lower than for LinearRegression and SVM. The dimensions were reduced, but given that it took 9 minutes (compared to about a minute on LinearRegression with the whole dataset), it may be worth spending time optimizing a previous method.

##Random Forest Classifier

We now train the last our simple models, namely a Random Forest Classifier. We again use sklearn to create our estimator.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rand_for_est = RandomForestClassifier(random_state=0, max_depth=50, n_estimators=200)

We now fit the model as we did above.

In [None]:
print('Random Forest Predictions\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = rand_for_est.fit(X_train, y_train[:,i])
  y_pred[:,i] = model.predict(X_test)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for SVM:', roc_auc_score(y_test, y_pred))

Random Forest Predictions


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     57888
           1       0.97      0.01      0.03      6090

    accuracy                           0.91     63978
   macro avg       0.94      0.51      0.49     63978
weighted avg       0.91      0.91      0.86     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.99      1.00      1.00     63611
           1       0.00      0.00      0.00       367

    accuracy                           0.99     63978
   macro avg       0.50      0.50      0.50     63978
weighted avg       0.99      0.99      0.99     63978
 


Classification statistics for obscene comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.94      1.00      0.97     60287
           1       1.00      0.01      0.01      3691

    accuracy                           0.94     63978
   macro avg       0.97      0.50      0.49     63978
weighted avg       0.95      0.94      0.92     63978
 


Classification statistics for threat comments
--------------------------------------------------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     63767
           1       0.00      0.00      0.00       211

    accuracy                           1.00     63978
   macro avg       0.50      0.50      0.50     63978
weighted avg       0.99      1.00      1.00     63978
 


Classification statistics for insult comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.95      1.00      0.97     60551
           1       1.00      0.00      0.01      3427

    accuracy                           0.95     63978
   macro avg       0.97      0.50      0.49     63978
weighted avg       0.95      0.95      0.92     63978
 


Classification statistics for identity_hate comments
--------------------------------------------------------------------------------


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99     63266
           1       0.00      0.00      0.00       712

    accuracy                           0.99     63978
   macro avg       0.49      0.50      0.50     63978
weighted avg       0.98      0.99      0.98     63978
 


Total classification statistics
              precision    recall  f1-score   support

           0       0.96      1.00      0.98    369370
           1       0.98      0.01      0.02     14498

    accuracy                           0.96    383868
   macro avg       0.97      0.50      0.50    383868
weighted avg       0.96      0.96      0.94    383868



ROC AUC score for SVM: 0.5019752870662554


The score is 0.5019752870662554 after about 9 minutes of fitting.

##Using reduced data for logistic regression and SVM

We now see how logistic regression and SVM run on reduced data to see how it compares to running it on the whole data set. We will now use 100 dimensions for the reduced data, which is recommended for LSA in the TruncatedSVM documentation.

In [None]:
#first, we create our transformers, starting with 10 components
svd = TruncatedSVD(n_components=100)
#we fit the model on X_train and transform both X_train and X_test
X_train_red = svd.fit_transform(X_train)
X_test_red = svd.transform(X_test)
print('Sizes of reduced data sets:', X_train_red.shape, ',', X_test_red.shape)

Sizes of reduced data sets: (159571, 100) , (63978, 100)


Let's again measure how much variance is explained by this reduction in the data.

In [None]:
print('Explained variance:', svd.explained_variance_.sum())

Explained variance: 0.12212808179083937


This says that about 12% of the variance of the data is still explained, so we likely will increase the number of components in future iterations.

In [None]:
print('Logistic Regression Predictions Using Reduced Data\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = log_reg.fit(X_train_red, y_train[:,i])
  y_pred[:,i] = model.predict(X_test_red)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for logistic regression:', roc_auc_score(y_test, y_pred))

Logistic Regression Predictions Using Reduced Data


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.85      0.91     57888
           1       0.38      0.85      0.52      6090

    accuracy                           0.85     63978
   macro avg       0.68      0.85      0.72     63978
weighted avg       0.92      0.85      0.88     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.93      0.96     63611
           1       0.07      0.94      0.13       367

    accuracy                           0.93     63978
   macro avg       0.53      0.93      0.55     63978
weighted avg       0.99      0.93      0.96     63978
 


Classification statistics fo

The score for logistic regression on reduced data is 0.8678501055627228, which is about 4% worse than on the entire data set.

In [None]:
print('SVM Model Predictions\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = svm_estimator.fit(X_train_red, y_train[:,i])
  y_pred[:,i] = model.predict(X_test_red)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for SVM:', roc_auc_score(y_test, y_pred))

SVM Model Predictions


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.98      0.87      0.92     57888
           1       0.40      0.84      0.54      6090

    accuracy                           0.86     63978
   macro avg       0.69      0.85      0.73     63978
weighted avg       0.93      0.86      0.88     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.93      0.96     63611
           1       0.07      0.92      0.13       367

    accuracy                           0.93     63978
   macro avg       0.53      0.93      0.55     63978
weighted avg       0.99      0.93      0.96     63978
 


Classification statistics for obscene comments
----------



              precision    recall  f1-score   support

           0       1.00      0.78      0.88     63767
           1       0.01      0.90      0.03       211

    accuracy                           0.78     63978
   macro avg       0.51      0.84      0.45     63978
weighted avg       1.00      0.78      0.88     63978
 


Classification statistics for insult comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.89      0.94     60551
           1       0.30      0.81      0.43      3427

    accuracy                           0.89     63978
   macro avg       0.64      0.85      0.69     63978
weighted avg       0.95      0.89      0.91     63978
 


Classification statistics for identity_hate comments
--------------------------------------------------------------------------------




              precision    recall  f1-score   support

           0       1.00      0.79      0.88     63266
           1       0.05      0.88      0.09       712

    accuracy                           0.80     63978
   macro avg       0.52      0.84      0.49     63978
weighted avg       0.99      0.80      0.88     63978
 


Total classification statistics
              precision    recall  f1-score   support

           0       0.99      0.86      0.92    369370
           1       0.19      0.83      0.31     14498

    accuracy                           0.86    383868
   macro avg       0.59      0.85      0.62    383868
weighted avg       0.96      0.86      0.90    383868



ROC AUC score for SVM: 0.8634345482710714


The score here is 0.8634345482710714, which is a bit better than SVM performed on the original data.

We try transforming our data one more time to see how the performance is with 1000 components.

In [None]:
from sklearn.decomposition import TruncatedSVD

#first, we create our transformers, starting with 10 components
svd = TruncatedSVD(n_components=1000)
#we fit the model on X_train and transform both X_train and X_test
X_train_red = svd.fit_transform(X_train)
X_test_red = svd.transform(X_test)

In [None]:
print('Explained variance:', svd.explained_variance_.sum())

Explained variance: 0.37908675784413576


Now, 37.9% of the variance is explained. Let's see how linear regression and SVM do.

In [None]:
print('Logistic Regression Predictions Using 1000 Components\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = log_reg.fit(X_train_red, y_train[:,i])
  y_pred[:,i] = model.predict(X_test_red)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for logistic regression (1000 components):', roc_auc_score(y_test, y_pred))

Logistic Regression Predictions Using 1000 Components


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.86      0.92     57888
           1       0.39      0.89      0.54      6090

    accuracy                           0.86     63978
   macro avg       0.69      0.87      0.73     63978
weighted avg       0.93      0.86      0.88     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       1.00      0.93      0.96     63611
           1       0.07      0.95      0.14       367

    accuracy                           0.93     63978
   macro avg       0.54      0.94      0.55     63978
weighted avg       0.99      0.93      0.96     63978
 


Classification statistics

This is much closer to our original performance for logistic regression.

In [None]:
print('SVM Model Predictions (1000 components)\n', '='*80, '\n\n', sep='')
y_pred = np.zeros(y_test.shape)
for i in range(y_train.shape[1]):
  print('Classification statistics for ', train.columns[2+i], ' comments\n',
        '-'*80, sep='')
  model = svm_estimator.fit(X_train_red, y_train[:,i])
  y_pred[:,i] = model.predict(X_test_red)
  print(classification_report(y_test[:,i], y_pred[:,i]), '\n\n')
print('Total classification statistics\n', '='*80, '\n', '='*80, sep='')
print(classification_report(y_test.ravel(), y_pred.ravel()))
print('\n\nROC AUC score for SVM (1000 components):', roc_auc_score(y_test, y_pred))

SVM Model Predictions (1000 components)


Classification statistics for toxic comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.85      0.92     57888
           1       0.39      0.89      0.54      6090

    accuracy                           0.86     63978
   macro avg       0.69      0.87      0.73     63978
weighted avg       0.93      0.86      0.88     63978
 


Classification statistics for severe_toxic comments
--------------------------------------------------------------------------------




              precision    recall  f1-score   support

           0       1.00      0.93      0.96     63611
           1       0.07      0.92      0.12       367

    accuracy                           0.93     63978
   macro avg       0.53      0.92      0.54     63978
weighted avg       0.99      0.93      0.96     63978
 


Classification statistics for obscene comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.91      0.95     60287
           1       0.38      0.87      0.53      3691

    accuracy                           0.91     63978
   macro avg       0.69      0.89      0.74     63978
weighted avg       0.96      0.91      0.93     63978
 


Classification statistics for threat comments
--------------------------------------------------------------------------------




              precision    recall  f1-score   support

           0       1.00      0.95      0.97     63767
           1       0.05      0.88      0.10       211

    accuracy                           0.95     63978
   macro avg       0.53      0.91      0.54     63978
weighted avg       1.00      0.95      0.97     63978
 


Classification statistics for insult comments
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.89      0.94     60551
           1       0.30      0.87      0.45      3427

    accuracy                           0.89     63978
   macro avg       0.65      0.88      0.69     63978
weighted avg       0.95      0.89      0.91     63978
 


Classification statistics for identity_hate comments
--------------------------------------------------------------------------------




              precision    recall  f1-score   support

           0       1.00      0.90      0.94     63266
           1       0.09      0.87      0.16       712

    accuracy                           0.90     63978
   macro avg       0.54      0.88      0.55     63978
weighted avg       0.99      0.90      0.94     63978
 


Total classification statistics
              precision    recall  f1-score   support

           0       0.99      0.91      0.95    369370
           1       0.27      0.88      0.41     14498

    accuracy                           0.90    383868
   macro avg       0.63      0.89      0.68    383868
weighted avg       0.97      0.90      0.93    383868



ROC AUC score for SVM (1000 components): 0.8929973297971223


In this case, we get a score of 0.8929973297971223, which is an improvement over previous models using SVM.

#To do



1. Improve preprocessing: View examples that are being misclassified and see if there are particular types of misspellings that are not being recognized (such as abbreviations, use of @ and other symbols, and so on). This may also involve adding a similarity metric to my words and/or adding stemming.
2. Create visualizations near the beginning to explore the data more in-depth.
3. Add a deep learning model. 

