# Text Classification on Hate Speech

---

__2nd Semester Data Science Master__  
__Beuth University of Applied Sciences Berlin__

__by Arndt, Ana, Christian, Ervin, Malte__ 


# Content
--- 

1. Preprocessing
1. Baseline
1. Pretrained Vectors
1. Ensembles
1. Recap

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier
import sklearn.metrics as skm
from os import chdir
from sklearn import model_selection
import skift
import warnings
warnings.filterwarnings('ignore')

In [10]:
chdir("/home/arndt/git-reps/hatespeech/")

df_test = pd.merge(pd.read_csv("data/test.csv"),
                   pd.read_csv("data/test_labels.csv"),
                   how="inner",
                   on="id")

df_train=pd.read_csv("data/data_unidecode.csv")

# data preperation

df_train=df_train[["id","comment_text","toxic"]]
df_train["label"]="__label__not_toxic"
df_train.loc[df_train["toxic"]==1,"label"]="__label__toxic"
df_train["comment_text"]=df_train["comment_text"].apply(str.replace,args=("\n"," "))
df_train["comment_text"]=df_train["comment_text"].apply(str.replace,args=("\"",""))

df_test["comment_text"]=df_test["comment_text"].apply(str.replace,args=("\n"," "))
df_test["comment_text"]=df_test["comment_text"].apply(str.replace,args=("\"",""))

X_train = pd.DataFrame(df_train.loc[:,"comment_text"])
y_train = df_train.loc[:,"toxic"]
#-1 means the row wasn't used for scoring in the Kaggle competition
X_test = pd.DataFrame(df_test[df_test["toxic"]>-1].loc[:,"comment_text"]) 
y_test = df_test[df_test["toxic"]>-1].loc[:,"toxic"]

# Data Preprocessing and Normalization  
---  
Improve the performance of the model applying some simple pre-processing  
- Translation into English 
- Only ASCII characters (unidecode)
- Remove special characters 
- Change Emojis to words

# Google Translation API Requests
---

![Dashboard](google_dashboard.png)

# Replace Characters  
---  
```py 
def replacetext(text):
    for key, value in REPLACE_TO.items():
        text = text.replace(key, value)
    return text
```

```py 
REPLACE_TO = { 
':)':'happy' , ':(':'sad', ':P':'funny' ,  
'@':'at' , '&':'and' , 'i\'m':'i am' , 'don\'t':'do not' , 'can\'t':'can not' ,   
'.':'' , ',':'' , ':':'', ';':'' , '!':'' , '\'':' ' , '?':' ' , '(':' ', ')':' ' , '[':' ' , ']':' ' , '-':' ' , '#':' ' , '=':' ' , '+':' ' , '/':' ' , '"':' ' ,   
'0':' zero ' , '1':' one ' , '2':' two ' , '3':' three ', '3':' three ' , '4':' four ' , '5':' five ' ,'6' :' six ' , '7':' seven ' , '8':' eight ' , '9':' nine ' }
```

# Scoring function

In [11]:
def score_preds(y_true, y_pred):
    print("confusion matrix:")
    print(str(skm.confusion_matrix(y_true, y_pred)))
    print("classification report:")
    print(str(skm.classification_report(y_true, y_pred)))
    print("f1 macro: %0.4f" % (skm.precision_recall_fscore_support(y_true, y_pred, average='macro')[2]))
    print("f1 micro: %0.4f" % (skm.precision_recall_fscore_support(y_true, y_pred, average='micro')[2]))

# Majority class classifier


In [29]:
score_preds(y_test, np.zeros(y_test.shape)) #set all predictions to non-toxic (0)

confusion matrix:
[[57888     0]
 [ 6090     0]]
classification report:
             precision    recall  f1-score   support

          0       0.90      1.00      0.95     57888
          1       0.00      0.00      0.00      6090

avg / total       0.82      0.90      0.86     63978

f1 macro: 0.4750
f1 micro: 0.9048


* By only assigning all fitted values to the majority class we get a F1 score of 90%. 
* This is, because the test dataset is imbalanced and contains only 10% toxic comments.

# Single skift model

skift stands for scikit fasttext - scikit-learn wrappers for Python ([GitHub](https://github.com/shaypal5/skift))

In [12]:
skift_clf = skift.FirstObjFtClassifier()
skift_clf.fit(X_train, y_train)

preds = skift_clf.predict(X_test)
score_preds(y_test, preds)

print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train))) #overfitted?

confusion matrix:
[[55268  2620]
 [ 2152  3938]]
classification report:
             precision    recall  f1-score   support

          0       0.96      0.95      0.96     57888
          1       0.60      0.65      0.62      6090

avg / total       0.93      0.93      0.93     63978

f1 macro: 0.7907
f1 micro: 0.9254
f1 micro on training data: 0.9745


# Single skift model - pretrained vectors

fastText English Word Vectors trained on Wikipedia 2017, UMBC webbase corpus, and statmt.org

In [27]:
skift_clf = skift.FirstObjFtClassifier(minn=3, maxn=3, pretrainedVectors="data/wiki-news-300d-1M.vec")
skift_clf.fit(X_train, y_train)

preds = skift_clf.predict(X_test)
score_preds(y_test, preds)

print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train))) #overfitted?

score on test data: 0.9294
score on training data: 0.9676
confusion matrix:
[[54933  2955]
 [ 1564  4526]]
classification report:
             precision    recall  f1-score   support

          0       0.97      0.95      0.96     57888
          1       0.60      0.74      0.67      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8138
f1 micro: 0.9294


In [20]:
### For non-toxic ###
# Recall = TP/(TP+FN)
print(54771/(54771+1500))
# Precision = TP/(TP+FP)
print(54771/(54771+3117))

### For toxic ###
# Recall = TP/(TP+FN)
print(4590/(4590+1500))
# Precision = TP/(TP+FP)
print(4590/(4590+3117))

0.9733432851735352
0.9461546434494196
0.7536945812807881
0.5955624756714675


# Check common errors - false negatives

What commonnalities have the false negatives? Check common errors …

In [13]:
comp = pd.merge(pd.DataFrame({"comment_text" : X_test["comment_text"].values, "toxic_pred" : preds}),
                pd.concat([X_test, y_test], axis=1))

In [18]:
false_negatives = comp[(comp["toxic"]==1) & (comp["toxic_pred"]==0)]
print(false_negatives.shape)
false_negatives["comment_text"].to_csv("data/false_negatives.txt")
print(false_negatives.head())

(2152, 3)
                                          comment_text  toxic_pred  toxic
8    == Arabs are committing genocide in Iraq, but ...         0.0      1
29                :Fuck off, you anti-semitic cunt.  |         0.0      1
38   How dare you vandalize that page about the HMS...         0.0      1
127  :::::::::Moi? Ego? I am mortified that you cou...         0.0      1
133      So, on the tenth anniversary of 9/11, New ...         0.0      1


# Check common errors - false positives

What commonnalities have the false positives? Check common errors …

In [15]:
false_positives = comp[(comp["toxic"]==0) & (comp["toxic_pred"]==1)]
print(false_positives.shape)
false_positives["comment_text"].to_csv("data/false_positives.txt")
print(false_positives.head())

(2620, 3)
                                          comment_text  toxic_pred  toxic
2    ::: Somebody will invariably try to add Religi...         1.0      0
27   I WILL BURN YOU TO HELL IF YOU REVOKE MY TALK ...         1.0      0
79                           WHAT THE HELL      Justin         1.0      0
99                               and lewd sex in China         1.0      0
148      ::: You have my trust. But trust me on thi...         1.0      0


# Ensemble predictions

Make predictions with a list of classifiers on a dataframe X.

In [10]:
def ensemble_predict_proba(classifiers, X):
    proba = [classifier.predict_proba(X) for classifier in classifiers]
    mean = np.zeros(proba[0].shape)
    for i in range(len(classifiers)):
        mean = mean + proba[i]
    mean = mean / float(len(classifiers))
    return mean

def ensemble_predict(classifiers, X):
    kfold_proba = ensemble_predict_proba(classifiers, X)
    kfold_labels = np.zeros(kfold_proba.shape[0]) #initialize array
    kfold_labels[kfold_proba[:,0]<=kfold_proba[:,1]] = 1
    return kfold_labels

# Build multiple models using K-Folds

In [None]:
seed = 77
kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=seed) #add variance through randomnization

# build multiple models using k folds:
kfold_clfs = list()
for train_index, test_index in kfold.split(df_train):
    X = pd.DataFrame(df_train.loc[:,"comment_text"])
    y = df_train.loc[:,"toxic"]
    clf = skift.FirstObjFtClassifier(minn=3, maxn=3, pretrainedVectors="data/wiki-news-300d-1M.vec")
    clf.fit(X.iloc[train_index], y.iloc[train_index])
    clf.model.quantize()
    print("Score on test proportion of this fold: %0.3f" % (clf.score(X.iloc[test_index], y.iloc[test_index])))
    kfold_clfs.append(clf)

In [None]:
score_preds(y_test, ensemble_predict(kfold_clfs, X_test))

# StratifiedKFold

* There are different strategies in creating a train set and test set split of your data. 
* If you want to keep the percentage for each class in each fold the same you want to use a stratified split.

In [None]:
seed = 77
stkfold = model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

# build multiple models using k folds:
stkfold_clfs = list()
for train_index, test_index in stkfold.split(X=pd.DataFrame(df_train.loc[:,"comment_text"]), 
                                             y = df_train.loc[:,"toxic"]):
    clf = skift.FirstObjFtClassifier(minn=3, maxn=3, pretrainedVectors="data/wiki-news-300d-1M.vec")
    clf.fit(X.iloc[train_index], y.iloc[train_index])
    clf.model.quantize()
    print("Score on test proportion of this fold: %0.3f" % (clf.score(X.iloc[test_index], y.iloc[test_index])))
    stkfold_clfs.append(clf)

In [None]:
score_preds(y_test, ensemble_predict(stkfold_clfs, X_test))

## KFold Conclusion

The main reason we started to use KFold was that we didn't have the labeled test data at the beginning. But after we found the real test data on Kaggle, we used it.

* k=10 slightly improved the score on the test set
* k=5 scored worse than just a single model on all the training data 
* StratifiedKFold performed worse than the just KFold.

I would not balance the data within the folds, as the data will not be balanced in a real-world example. Thus, the cross-validation score will not be represent the model performance well.

Some ways to deal with imbalanced data is under- and over-sampling (e.g. SMOTE).

# Oversampling (the minority class)

In [None]:
from collections import Counter
print(sorted(Counter(y_train).items()))

# TODO: 
* Oversampling simply with df.sample()
* use higher fraction of df[df["toxic"]==1]
* use lower fraction of df[df["toxic"]==0]
* Bagging on some models

In [None]:

X_resampled = X_train[X_train.toxic=1].sample()
y_resampled = y_train[X_resampled.index]
print(sorted(Counter(y_resampled).items()))