In [None]:
%cd '/content/drive/My Drive/Toxic Content Detection'

/content/drive/My Drive/Toxic Content Detection


## Loading Dataset

Dataset taken from: [Kaggle - Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)

Dataset has 8 columns:
*   Index: **id** 
*   Input: **comment_text**
*   Target Classes: **toxic**, **severe_toxic**, **obscene**, **thread**, **insult**, **identity_hate**

We combined all the target classes into a single target class named **toxic** 

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Loading train data

train = pd.read_csv('Datasets/jigsaw-toxic-comment/train.csv', index_col="id")
train['toxic'] = (train['toxic'] | train['severe_toxic'] | train['obscene'] | train['threat'] | train['insult'] | train['identity_hate']).astype('category')
train = train.drop(['severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1)

train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 159571 entries, 0000997932d777bf to fff46fc426af1f9a
Data columns (total 2 columns):
 #   Column        Non-Null Count   Dtype   
---  ------        --------------   -----   
 0   comment_text  159571 non-null  object  
 1   toxic         159571 non-null  category
dtypes: category(1), object(1)
memory usage: 2.6+ MB


In [None]:
train.head(10)

Unnamed: 0_level_0,comment_text,toxic
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0
00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0
0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0
00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0
00040093b2687caa,alignment on this subject and which are contra...,0


In [None]:
# Loading test data

test = pd.read_csv('Datasets/jigsaw-toxic-comment/test.csv', index_col="id")
test_labels = pd.read_csv('Datasets/jigsaw-toxic-comment/test_labels.csv', index_col="id")
test = test.join(test_labels)
test['toxic'] = (test['toxic'] | test['severe_toxic'] | test['obscene'] | test['threat'] | test['insult'] | test['identity_hate']).astype('category')
test = test.drop(['severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1)
test = test.drop(test[test['toxic'] == -1].index)
test['toxic'] = test['toxic'].cat.remove_unused_categories()

test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63978 entries, 0001ea8717f6de06 to fffb5451268fb5ba
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   comment_text  63978 non-null  object  
 1   toxic         63978 non-null  category
dtypes: category(1), object(1)
memory usage: 1.0+ MB


In [None]:
test.tail(5)

Unnamed: 0_level_0,comment_text,toxic
id,Unnamed: 1_level_1,Unnamed: 2_level_1
fff8f64043129fa2,":Jerome, I see you never got around to this…! ...",0
fff9d70fe0722906,==Lucky bastard== \n http://wikimediafoundatio...,0
fffa8a11c4378854,==shame on you all!!!== \n\n You want to speak...,0
fffac2a094c8e0e2,MEL GIBSON IS A NAZI BITCH WHO MAKES SHITTY MO...,1
fffb5451268fb5ba,""" \n\n == Unicorn lair discovery == \n\n Suppo...",0


## Preprocessing

Since the comments are raw, they contain **formatting markups** and **special characters**.

### We do the following preprocessing:
*   Removing all the punctuations. 
*   Splitting words by alphabets (this removes non-alphanumeric values).
*   Stripping the words to remove extra spacing.
*   Rejoining the cleaned words.
*   Lower casing all the comments.

In [None]:
import re
import string

def preprocess(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = re.split('\W+', text)
    text = " ".join(tokens).strip().lower()
    return text

In [None]:
# Applying preprocessing on train and test data

train['comment_text'] = train['comment_text'].apply(preprocess)
test['comment_text'] = test['comment_text'].apply(preprocess)

In [None]:
train.head(10)

Unnamed: 0_level_0,comment_text,toxic
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000997932d777bf,explanation why the edits made under my userna...,0
000103f0d9cfb60f,daww he matches this background colour im seem...,0
000113f07ec002fd,hey man im really not trying to edit war its j...,0
0001b41b1c6bb37e,more i cant make any real suggestions on impro...,0
0001d958c54c6e35,you sir are my hero any chance you remember wh...,0
00025465d4725e87,congratulations from me as well use the tools ...,0
0002bcb3da6cb337,cocksucker before you piss around on my work,1
00031b1e95af7921,your vandalism to the matt shirvington article...,0
00037261f536c51d,sorry if the word nonsense was offensive to yo...,0
00040093b2687caa,alignment on this subject and which are contra...,0


In [None]:
test.tail(5)

Unnamed: 0_level_0,comment_text,toxic
id,Unnamed: 1_level_1,Unnamed: 2_level_1
fff8f64043129fa2,jerome i see you never got around to this i m ...,0
fff9d70fe0722906,lucky bastard httpwikimediafoundationorgwikipr...,0
fffa8a11c4378854,shame on you all you want to speak about gays ...,0
fffac2a094c8e0e2,mel gibson is a nazi bitch who makes shitty mo...,1
fffb5451268fb5ba,unicorn lair discovery supposedly a unicorn la...,0


## Sampling dataset

Here target class distribution is imbalanced ( 9 non-toxic : 1 toxic ), we used different sampling methods to balance the data:

*   **Original**: Uses imbalanced classes for training.
*   **Undersampled**: Randomly selects subset of samples from classes with higher counts to match the count of class with lowest count.
*   **Oversampled**: Randomly selects samples from classes with lower counts to duplicate to match the count of class with highest count.


In [None]:
# Looking at the distribution of target class
train['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

In [None]:
sampling="Original"

if sampling=="Original":
    x_train_raw = train['comment_text'].to_numpy().reshape((-1))
    y_train = train['toxic'].to_numpy().reshape((-1))
elif sampling=="Undersampled":
    from imblearn.under_sampling import RandomUnderSampler
    undersampler = RandomUnderSampler()
    x_train_raw, y_train = undersampler.fit_resample(
                                train['comment_text'].to_numpy().reshape((-1, 1)), 
                                train['toxic'].to_numpy().reshape((-1, 1))
                            )
elif sampling=="Oversampled":
    from imblearn.over_sampling import RandomOverSampler
    oversampler = RandomOverSampler()
    x_train_raw, y_train = oversampler.fit_resample(
                               train['comment_text'].to_numpy().reshape((-1, 1)), 
                               train['toxic'].to_numpy().reshape((-1, 1))
                            )

x_test_raw = test['comment_text'].to_numpy().reshape((-1))
y_test = test['toxic'].to_numpy().reshape((-1))

## Embedding Layer

We used various pretrained embedding layers from [tfhub](https://tfhub.dev/) and compared their performances.

### We used following embedding layers:
*   [Wiki Words 500](https://tfhub.dev/google/Wiki-words-500/2): based on skipgram version of word2vec with 1 out-of-vocabulary bucket.
*   [NNLM](https://tfhub.dev/google/nnlm-en-dim128/2): based on feed-forward Neural-Net Language Models with 3 hidden layers.
*   [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/4): based on deep averaging network (DAN) encoder.
*   [Universal Sentence Encoder Lite](https://tfhub.dev/google/universal-sentence-encoder-lite/2): based on Transformer architecture.

In [None]:
!pip3 install --quiet sentencepiece
import tensorflow_hub as hub

# For NNLM, WW500, USE
def hub_embed(link):    
    def load_embed():
        import tensorflow as tf

        embed = hub.load(link)
        def get_embedding(msgs):
            return embed(msgs).numpy()
        return get_embedding
    return load_embed

# For USE lite
def use_lite_embed():
    import sentencepiece as spm
    from absl import logging
    import tensorflow.compat.v1 as tf
    tf.disable_v2_behavior()

    module = hub.Module("https://tfhub.dev/google/universal-sentence-encoder-lite/2")
    input_placeholder = tf.sparse_placeholder(tf.int64, shape=[None, None])
    encodings = module(
        inputs=dict(
            values=input_placeholder.values,
            indices=input_placeholder.indices,
            dense_shape=input_placeholder.dense_shape))
    
    with tf.Session() as sess:
        spm_path = sess.run(module(signature="spm_path"))
    sp = spm.SentencePieceProcessor()
    sp.Load(spm_path)

    def process_to_IDs_in_sparse_format(sp, sentences):
        ids = [sp.EncodeAsIds(x) for x in sentences]
        max_len = max(len(x) for x in ids)
        dense_shape=(len(ids), max_len)
        values=[item for sublist in ids for item in sublist]
        indices=[[row,col] for row in range(len(ids)) for col in range(len(ids[row]))]
        return (values, indices, dense_shape)

    def embed(msgs):
        values, indices, dense_shape = process_to_IDs_in_sparse_format(sp, msgs)
        logging.set_verbosity(logging.ERROR)

        with tf.Session() as session:
            session.run([tf.global_variables_initializer(), tf.tables_initializer()])
            return session.run(
                encodings,
                feed_dict={input_placeholder.values: values,
                            input_placeholder.indices: indices,
                            input_placeholder.dense_shape: dense_shape})
    return embed

In [None]:
# Embeddings
embeddings = [
              {'code':"WW500", 'name': "wiki-word-500", 'embed_func': hub_embed("https://tfhub.dev/google/Wiki-words-500-with-normalization/2")},
              {'code':"NNLM128", 'name': "nnlm-en-dim128", 'embed_func': hub_embed("https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2")},
              {'code':"USE", 'name': "universal-sentence-encoder", 'embed_func': hub_embed("https://tfhub.dev/google/universal-sentence-encoder/4")},
              {'code':"USEL", 'name': "universal-sentence-encoder-lite", 'embed_func': use_lite_embed},
]

## Classifiers

16 different classifiers from SKLearn were used to compare their performances, training times and prediction times.

In [None]:
# Classifiers
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier, PassiveAggressiveClassifier, Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier

classifiers = [
               {"name": "LR", "classifier": LogisticRegression},
               {"name": "SGD", "classifier": SGDClassifier},
               {"name": "Ridge", "classifier": RidgeClassifier},
               {"name": "LDA", "classifier": LinearDiscriminantAnalysis},
               {"name": "PA", "classifier": PassiveAggressiveClassifier},
               {"name": "GNB", "classifier": GaussianNB},
               {"name": "Perceptron", "classifier": Perceptron},
               {"name": "ET", "classifier": ExtraTreeClassifier},
               {"name": "DT", "classifier": DecisionTreeClassifier},
               {"name": "RF", "classifier": RandomForestClassifier},
               {"name": "AdaB", "classifier": AdaBoostClassifier},
               {"name": "MLP", "classifier": MLPClassifier},
               {"name": "Bagging", "classifier": BaggingClassifier},
               {"name": "GradB", "classifier": GradientBoostingClassifier},
               {"name": "SVC", "classifier": SVC},
               {"name": "KNN", "classifier": KNeighborsClassifier},
]

## Metrics

All the models are evaluated on 6 different metrics as below.

In [None]:
# Metrics
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

def confusion_matrix_proper(y_true, pred):
    return ",".join(str(score) for score in confusion_matrix(y_true, pred).ravel())

metrics = [
           {"name": "accuracy", "metric": accuracy_score},
           {"name": "balanced_accuracy", "metric": balanced_accuracy_score},
           {"name": "tn,fp,fn,tp", "metric": confusion_matrix_proper},
           {"name": "precision", "metric": precision_score},
           {"name": "recall", "metric": recall_score},
           {"name": "f1", "metric": f1_score},
]

## Training and Evaluating Models

We saved all the combinations of **models** and **embeddings** in different folders based on sampling method used.

These folders contain a csv file comparing performance of different models using different embeddings and all the saved models.

Links to saved models and results:
*   [Original](https://bit.ly/2yyOczB)
*   [Undersampled](https://bit.ly/3eBgbzc)
*   [Oversampled](https://bit.ly/3eDikug)

In [None]:
# Helper function to apply embedding to batches of data instead of whole data at a time

def batch_map(func, iterable, batch_size=1):
    from tqdm import tqdm
    l = len(iterable)
    result = list()
    for ndx in tqdm(range(0, l, batch_size)):
        result.extend(func(iterable[ndx:min(ndx + batch_size, l)]))
    return result

In [None]:
from time import time
import pickle

with open(f'Models/Embed+SKLearn/{sampling}/models.csv', 'w') as logfile:
    logfile.write(f"timestamp,classifier,embedding,time_to_train,time_to_predict")
    for metric in metrics:
        logfile.write(f",{metric['name']}")
    logfile.write("\n")

for embedding in embeddings:
    print(f"Embedding: {embedding['name']}\n")
    
    # Embedding
    print(f"Processing embedding...", end="")
    embed = embedding['embed_func']()
    print("Done")
    print("Embedding train set...", end="")
    x_train = batch_map(embed, x_train_raw.reshape((-1)), batch_size=10000)
    print("Done")

    print("Embedding test set...", end="")
    x_test = batch_map(embed, x_test_raw.reshape((-1)), batch_size=10000)
    print("Done")

    for classifier in classifiers:
        # Initialize classifier
        print(f"\nClassifier: {classifier['name']}\n")
        clf = classifier['classifier']()

        # Train with timing
        print("Training...", end="")
        time_to_train = time()
        clf.fit(x_train, y_train)
        time_to_train = time() - time_to_train
        print(f"Done. Took {time_to_train}s")

        # Save model
        print("Saving model...", end="")
        with open(f"Models/Embed+SKLearn/Original/{embedding['code']}-{classifier['name']}.model", 'wb') as clf_file:
            pickle.dump(clf, clf_file)
        print("Done")

        # Predict with timing
        print("Predicting for test set...", end="")
        time_to_predict = time()
        prediction = clf.predict(x_test)
        time_to_predict = time() - time_to_predict
        print(f"Done. Took {time_to_predict}s")

        # Save logs
        print(f"Metrics:")
        with open('Models/Embed+SKLearn/Original/models.csv', 'a') as logfile:
            logfile.write(f"{time()},{classifier['name']},{embedding['name']},{time_to_train},{time_to_predict}")
            for metric in metrics:
                score = metric['metric'](y_test, prediction)
                print(f"{metric['name']}: {score}")
                logfile.write(f",{score}")
            logfile.write("\n")

        del clf
        del prediction

    print("\n\n")
    del embed
    del x_train
    del x_test

Embedding: wiki-word-500

Processing embedding...

  0%|          | 0/16 [00:00<?, ?it/s]

Done
Embedding train set...

100%|██████████| 16/16 [00:03<00:00,  4.19it/s]
  0%|          | 0/7 [00:00<?, ?it/s]

Done
Embedding test set...

100%|██████████| 7/7 [00:01<00:00,  4.86it/s]


Done

Classifier: LR

Training...Done. Took 10.429743766784668s
Saving model...Done
Predicting for test set...Done. Took 0.20615601539611816s
Metrics:
accuracy: 0.9203319891212605
balanced_accuracy: 0.7902137876885089
tn,fp,fn,tp: 54957,2778,2319,3924
precision: 0.585496866606983
recall: 0.628543969245555
f1: 0.6062572421784472

Classifier: SGD

Training...Done. Took 2.8988256454467773s
Saving model...Done
Predicting for test set...Done. Took 0.18036818504333496s
Metrics:
accuracy: 0.9260370752446153
balanced_accuracy: 0.7605172383872743
tn,fp,fn,tp: 55782,1953,2779,3464
precision: 0.6394683404098209
recall: 0.5548614448181963
f1: 0.5941680960548885

Classifier: Ridge

Training...Done. Took 1.51019287109375s
Saving model...Done
Predicting for test set...Done. Took 0.14039850234985352s
Metrics:
accuracy: 0.9094845103004158
balanced_accuracy: 0.5674152109338817
tn,fp,fn,tp: 57298,437,5354,889
precision: 0.6704374057315233
recall: 0.14239948742591702
f1: 0.23490553573787817

Classifier: L

  6%|▋         | 1/16 [00:00<00:02,  5.09it/s]

Done
Embedding train set...

100%|██████████| 16/16 [00:02<00:00,  6.03it/s]
 14%|█▍        | 1/7 [00:00<00:00,  6.20it/s]

Done
Embedding test set...

100%|██████████| 7/7 [00:01<00:00,  6.67it/s]


Done

Classifier: LR

Training...Done. Took 3.1329996585845947s
Saving model...Done
Predicting for test set...Done. Took 0.08158206939697266s
Metrics:
accuracy: 0.9166119603613743
balanced_accuracy: 0.7580094215784083
tn,fp,fn,tp: 55141,2594,2741,3502
precision: 0.574475065616798
recall: 0.5609482620534999
f1: 0.5676310884188346

Classifier: SGD

Training...Done. Took 0.811194658279419s
Saving model...Done
Predicting for test set...Done. Took 0.05281496047973633s
Metrics:
accuracy: 0.9226765450623652
balanced_accuracy: 0.7277977637890265
tn,fp,fn,tp: 55999,1736,3211,3032
precision: 0.6359060402684564
recall: 0.4856639436168509
f1: 0.5507220052674598

Classifier: Ridge

Training...Done. Took 0.3179759979248047s
Saving model...Done
Predicting for test set...Done. Took 0.06239771842956543s
Metrics:
accuracy: 0.9075150833098877
balanced_accuracy: 0.5338236221294272
tn,fp,fn,tp: 57627,108,5809,434
precision: 0.8007380073800738
recall: 0.06951786000320359
f1: 0.12792925571112748

Classifier:

  0%|          | 0/16 [00:00<?, ?it/s]

Done
Embedding train set...

100%|██████████| 16/16 [02:51<00:00, 10.75s/it]
  0%|          | 0/7 [00:00<?, ?it/s]

Done
Embedding test set...

100%|██████████| 7/7 [01:30<00:00, 12.90s/it]


Done

Classifier: LR

Training...Done. Took 9.246548652648926s
Saving model...Done
Predicting for test set...Done. Took 0.20728802680969238s
Metrics:
accuracy: 0.9262246397199038
balanced_accuracy: 0.8117646422345847
tn,fp,fn,tp: 55078,2657,2063,4180
precision: 0.611379259909317
recall: 0.6695498958833894
f1: 0.6391437308868502

Classifier: SGD

Training...Done. Took 2.063481330871582s
Saving model...Done
Predicting for test set...Done. Took 0.17361140251159668s
Metrics:
accuracy: 0.9318515739785551
balanced_accuracy: 0.7773104484453088
tn,fp,fn,tp: 55964,1771,2589,3654
precision: 0.6735483870967742
recall: 0.5852955309947141
f1: 0.6263284196091875

Classifier: Ridge

Training...Done. Took 1.4535343647003174s
Saving model...Done
Predicting for test set...Done. Took 0.13072705268859863s
Metrics:
accuracy: 0.9317265309950296
balanced_accuracy: 0.7393121324722348
tn,fp,fn,tp: 56487,1248,3120,3123
precision: 0.7144818119423473
recall: 0.5002402691013935
f1: 0.5884680610514414

Classifier: 