# Text Classification on Hate Speech

---

__2nd Semester Data Science Master__  
__Beuth University of Applied Sciences Berlin__

__by Arndt, Ana, Christian, Ervin, Malte__ 


# Content
--- 

1. Preprocessing
1. Baseline
1. Improvements
1. Ensembles
1. Recap

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import BaggingClassifier
import sklearn.metrics as skm
from os import chdir
from sklearn import model_selection
import skift
import warnings
warnings.filterwarnings('ignore')

In [2]:
chdir("/home/arndt/git-reps/hatespeech/")

def load_train_test_data(train, test, test_labels, classname):

    df_test = pd.merge(pd.read_csv(test),
                       pd.read_csv(test_labels),
                       how="inner",
                       on="id")

    df_train=pd.read_csv(train)

    # data preperation

    df_train=df_train[["id","comment_text",classname]]
    df_train["comment_text"]=df_train["comment_text"].apply(str.replace,args=("\n"," "))
    df_train["comment_text"]=df_train["comment_text"].apply(str.replace,args=("\"",""))

    df_test["comment_text"]=df_test["comment_text"].apply(str.replace,args=("\n"," "))
    df_test["comment_text"]=df_test["comment_text"].apply(str.replace,args=("\"",""))

    X_train = pd.DataFrame(df_train.loc[:,"comment_text"])
    y_train = df_train.loc[:,classname]
    #-1 means the row wasn't used for scoring in the Kaggle competition
    X_test = pd.DataFrame(df_test[df_test[classname]>-1].loc[:,"comment_text"]) 
    y_test = df_test[df_test[classname]>-1].loc[:,classname]
    
    return X_train, X_test, y_train, y_test

def score_preds(y_true, y_pred):
    print("confusion matrix:")
    print(str(skm.confusion_matrix(y_true, y_pred)))
    print("classification report:")
    print(str(skm.classification_report(y_true, y_pred)))
    print("f1 macro: %0.4f" % (skm.precision_recall_fscore_support(y_true, y_pred, average='macro')[2]))
    print("f1 micro: %0.4f" % (skm.precision_recall_fscore_support(y_true, y_pred, average='micro')[2]))

# Data Preprocessing and Normalization  
---  
Improve the performance of the model applying some simple pre-processing  
- Translation into English 
- Only ASCII characters (unidecode)
- Remove special characters 
- Change Emojis to words

# Google Translation API Requests
---

![Dashboard](images/google_dashboard.png)

# Replace Characters  
---  
```py 
def replacetext(text):
    for key, value in REPLACE_TO.items():
        text = text.replace(key, value)
    return text
```

```py 
REPLACE_TO = {
   ':)': 'happy',':(': 'sad',':P': 'funny','@': 'at','&': 'and','i\'m': 'i am','don\'t': 'do not','can\'t': 'can not',
   '.': ' ',',': ' ',':': ' ', ';': ' ','!': ' ','\'': ' ','?': ' ', '(': ' ', ')': ' ', '[': ' ', ']': ' ', '-': ' ',
   '#': ' ', '=': ' ', '+': ' ', '/': ' ', '"': ' ', '0': ' zero ', '1': ' one ', '2': ' two ', '3': ' three ', '3': ' three ',
   '4': ' four ','5': ' five ', '6': ' six ', '7': ' seven ', '8': ' eight ', '9': ' nine ',
   '|':' ', '$':' ', '%':' ', '^':' ', '*':' ', '_':' ', '{':' ', '}':' ', '<':' ', '>':' ',
   '\n': ' '
}
```

# Majority class classifier


In [29]:
score_preds(y_test, np.zeros(y_test.shape)) #set all predictions to non-toxic (0)

confusion matrix:
[[57888     0]
 [ 6090     0]]
classification report:
             precision    recall  f1-score   support

          0       0.90      1.00      0.95     57888
          1       0.00      0.00      0.00      6090

avg / total       0.82      0.90      0.86     63978

f1 macro: 0.4750
f1 micro: 0.9048


* By only assigning all fitted values to the majority class we get a F1 score of 90%. 
* This is, because the test dataset is imbalanced and contains only 10% toxic comments.

# Single model - baseline

using skift (scikit fasttext) - scikit-learn wrappers for Python ([GitHub](https://github.com/shaypal5/skift))

In [41]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train.csv", "data/test.csv", "data/test_labels.csv", "toxic")

skift_clf = skift.FirstObjFtClassifier()
skift_clf.fit(X_train, y_train)
preds = skift_clf.predict(X_test)
score_preds(y_test, preds)
print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train)))

confusion matrix:
[[54325  3563]
 [ 1218  4872]]
classification report:
             precision    recall  f1-score   support

          0       0.98      0.94      0.96     57888
          1       0.58      0.80      0.67      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8143
f1 micro: 0.9253
f1 micro on training data: 0.9722


# Single model - preprocessed data

In [4]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train_unidecode.csv", "data/test_unidecode.csv", "data/test_labels.csv", "toxic")

skift_clf = skift.FirstObjFtClassifier()
skift_clf.fit(X_train, y_train)
preds = skift_clf.predict(X_test)
score_preds(y_test, preds)
print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train)))

confusion matrix:
[[54324  3564]
 [ 1222  4868]]
classification report:
             precision    recall  f1-score   support

          0       0.98      0.94      0.96     57888
          1       0.58      0.80      0.67      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8141
f1 micro: 0.9252
f1 micro on training data: 0.9722


# Single model - pretrained vectors

fastText English Word Vectors trained on Wikipedia 2017, UMBC webbase corpus, and statmt.org

** there is an issue with the pretrainedVectors parameter, it is beeing ignored **

In [35]:
skift_clf = skift.FirstObjFtClassifier(minn=3, maxn=3, pretrainedVectors="vectors/wiki-news-300d-1M-subword.vec")
skift_clf.fit(X_train, y_train)
preds = skift_clf.predict(X_test)
score_preds(y_test, preds)
print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train)))

confusion matrix:
[[54738  3150]
 [ 1495  4595]]
classification report:
             precision    recall  f1-score   support

          0       0.97      0.95      0.96     57888
          1       0.59      0.75      0.66      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8118
f1 micro: 0.9274
f1 micro on training data: 0.9684


# Single model - parameters

In [3]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train_unidecode.csv", "data/test_unidecode.csv", "data/test_labels.csv", "toxic")

skift_clf = skift.FirstObjFtClassifier(wordNgrams=2, maxn=3, dim=300)
skift_clf.fit(X_train, y_train)
preds = skift_clf.predict(X_test)
score_preds(y_test, preds)
print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train)))

confusion matrix:
[[55608  2280]
 [ 1939  4151]]
classification report:
             precision    recall  f1-score   support

          0       0.97      0.96      0.96     57888
          1       0.65      0.68      0.66      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8132
f1 micro: 0.9341
f1 micro on training data: 0.9586


In [20]:
### For non-toxic ###
# Recall = TP/(TP+FN)
print(54771/(54771+1500))
# Precision = TP/(TP+FP)
print(54771/(54771+3117))

### For toxic ###
# Recall = TP/(TP+FN)
print(4590/(4590+1500))
# Precision = TP/(TP+FP)
print(4590/(4590+3117))

0.9733432851735352
0.9461546434494196
0.7536945812807881
0.5955624756714675


# Check common errors - false negatives

What commonnalities have the false negatives? Check common errors …

In [42]:
comp = pd.merge(pd.DataFrame({"comment_text" : X_test["comment_text"].values, "toxic_pred" : preds}),
                pd.concat([X_test, y_test], axis=1))

In [43]:
false_negatives = comp[(comp["toxic"]==1) & (comp["toxic_pred"]==0)]
print(false_negatives.shape)
false_negatives["comment_text"].to_csv("data/false_negatives.txt")
print(false_negatives.head())

(1218, 3)
                                          comment_text  toxic_pred  toxic
8       arabs are committing genocide in iraq  but ...         0.0      1
106      well  it sucks to have a university to be ...         0.0      1
133       so  on the tenth anniversary of  nine   o...         0.0      1
195           hey shithead  stop vandilizing articles          0.0      1
279                   karl tearle is a mop haired twat         0.0      1


## False negatives: labeled as toxic, not identified as toxic

- "well  it **sucks** to have a university to be nicknameless  and it s the first time in ncaa history that it has happened"
- "   intolerance in india        india is a generator of liars like you"
- "look  you re a pedant  and fetzer is a jew hater  on press tv in the uk today   september  two nd  two  zero  one  one    he said that the israelis were behind  nine   one  one   the man is a complete fool "
- "not even every sexual person fantasizes while masturbating  most males do  but many females do not   i think most libidinous asexuals masturbate for the same reason they would scratch themselves if they were itchy "


- **few or no predominantly abusive-use words in a "normal speech context"**
- **many would actually not label these comments as toxic**
- **could argue that the classifier actually does a good job**

### but:
- "hey **shithead**  stop vandilizing articles " 

# Check common errors - false positives

What commonnalities have the false positives? Check common errors …

In [44]:
false_positives = comp[(comp["toxic"]==0) & (comp["toxic_pred"]==1)]
print(false_positives.shape)
false_positives["comment_text"].to_csv("data/false_positives.txt")
print(false_positives.head())

(3563, 3)
                                         comment_text  toxic_pred  toxic
1                     dear god this site is horrible          1.0      0
27  i will burn you to hell if you revoke my talk ...         1.0      0
78          shameless canvass       hello  diannaa...         1.0      0
79                          what the hell      justin         1.0      0
87      buffoon synonyms     bozo  buffo  clown  c...         1.0      0


## False Positives: labeled as non-toxic, identified as toxic

- "i will burn you to hell if you revoke my talk page access" - **wrongly labeled**
- "  buffoon synonyms     bozo  buffo  clown  comedian  comic  fool  harlequin  humorist  idiot  jerk  jester  joker  merry andrew  mime  mimic  mummer  playboy  prankster  ridicule  stooge  wag  wit  zany " - **special context**
- " gay       he s gay too  it should be noted that he has a male partner " **non-abusive use of a term that is predominantly used in an abusive way in the corpus** 

# Building fastText Ensembles
---

Make predictions with a collection of classifiers on a dataset X.

* K Fold
* Stratified K Fold
* Bagging
* Bagging with Oversampling

In [8]:
def ensemble_predict_proba(classifiers, X):
    proba = [classifier.predict_proba(X) for classifier in classifiers]
    mean = np.zeros(proba[0].shape)
    for i in range(len(classifiers)):
        mean = mean + proba[i]
    mean = mean / float(len(classifiers))
    return mean

def ensemble_predict(classifiers, X):
    kfold_proba = ensemble_predict_proba(classifiers, X)
    kfold_labels = np.zeros(kfold_proba.shape[0]) #initialize array
    kfold_labels[kfold_proba[:,0]<=kfold_proba[:,1]] = 1
    return kfold_labels

# K Folds
---

![KFold](images/kfold.png)

# Build multiple models using K-Folds

In [12]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train_unidecode.csv", "data/test_unidecode.csv", "data/test_labels.csv", "toxic")

kfold = model_selection.KFold(n_splits=10, shuffle=True) #add variance through randomnization

# build multiple models using k folds:
kfold_clfs = list()
for train_index, test_index in kfold.split(X_train):
    clf = skift.FirstObjFtClassifier(minn=3, maxn=3, pretrainedVectors="data/wiki-news-300d-1M.vec")
    clf.fit(X_train.iloc[train_index], y_train.iloc[train_index])
    print("Score on test proportion of this fold: %0.3f" % (clf.score(X_train.iloc[test_index], y_train.iloc[test_index])))
    kfold_clfs.append(clf)

Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.962
Score on test proportion of this fold: 0.958
Score on test proportion of this fold: 0.958
Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.960
Score on test proportion of this fold: 0.960
Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.960


In [16]:
score_preds(y_test, ensemble_predict(kfold_clfs, X_test))

confusion matrix:
[[54887  3001]
 [ 1431  4659]]
classification report:
             precision    recall  f1-score   support

          0       0.97      0.95      0.96     57888
          1       0.61      0.77      0.68      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8194
f1 micro: 0.9307


# StratifiedKFold

* There are different strategies in creating a train set and test set split of your data. 
* If you want to keep the percentage for each class in each fold the same you want to use a stratified split.

In [6]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train_unidecode.csv", "data/test_unidecode.csv", "data/test_labels.csv", "toxic")

stkfold = model_selection.StratifiedKFold(n_splits=10, shuffle=True)

# build multiple models using k folds:
stkfold_clfs = list()
for train_index, test_index in stkfold.split(X=X_train, y=y_train):
    clf = skift.FirstObjFtClassifier(minn=3, maxn=3, pretrainedVectors="data/wiki-news-300d-1M.vec")
    clf.fit(X_train.iloc[train_index], y_train.iloc[train_index])
    print("Score on test proportion of this fold: %0.3f" % (clf.score(X_train.iloc[test_index], y_train.iloc[test_index])))
    stkfold_clfs.append(clf)

Score on test proportion of this fold: 0.959
Score on test proportion of this fold: 0.960
Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.959
Score on test proportion of this fold: 0.961
Score on test proportion of this fold: 0.960
Score on test proportion of this fold: 0.960
Score on test proportion of this fold: 0.962
Score on test proportion of this fold: 0.963


In [9]:
score_preds(y_test, ensemble_predict(stkfold_clfs, X_test))

confusion matrix:
[[54915  2973]
 [ 1436  4654]]
classification report:
             precision    recall  f1-score   support

          0       0.97      0.95      0.96     57888
          1       0.61      0.76      0.68      6090

avg / total       0.94      0.93      0.93     63978

f1 macro: 0.8200
f1 micro: 0.9311


## KFold Conclusion

The main reason we started to use KFold was that we didn't have the labeled test data at the beginning. But after we found the real test data on Kaggle, we used it.

* k=10 slightly improved the score on the test set
* k=5 scored worse than just a single model on all the training data 
* StratifiedKFold performed worse than the just KFold.

I would not balance the data within the folds, as the data will not be balanced in a real-world example. Thus, the cross-validation score will not be represent the model performance well.

Some ways to deal with imbalanced data is under- and over-sampling (e.g. SMOTE).

![Bagging](images/bagging.png)

# Oversampling the minority class

In [10]:
def oversample(X, y, p_oversample_size, p_oversample_ratio):
    y_true_idx = y[y==1].index
    y_false_idx = y[y==0].index
    
    true_frac = float(X.loc[y_true_idx,].count() / X.count())
    false_frac = float(X.loc[y_false_idx,].count() / X.count())
    oversample_true_frac = p_oversample_size * p_oversample_ratio / true_frac
    oversample_false_frac = p_oversample_size * (1-p_oversample_ratio) / false_frac 
    
    X_true =  X.loc[y_true_idx,].sample(frac=oversample_true_frac, replace=True)
    X_false =  X.loc[y_false_idx,].sample(frac=oversample_false_frac, replace=True)

    X_resampled = pd.concat([X_true, X_false])
    y_resampled = y.loc[X_resampled.index]
    return X_resampled, y_resampled

In [None]:
from collections import Counter
print(sorted(Counter(y_train).items()))

# Bagging ensemble - Oversampling w/ seed 

In [37]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train_unidecode.csv", "data/test_unidecode.csv", "data/test_labels.csv", "toxic")

# Oversampling different fractions and score
oversample_clfs = list()
for f in list(np.arange(0.08, 0.11, 0.0025)):
    X_resampled, y_resampled = oversample(X_train, y_train, 1.25, f, seed)
    skift_clf = skift.FirstObjFtClassifier(lr=0.2)
    skift_clf.fit(X_resampled, y_resampled)
    print("oversampling fraction: %0.4f // score: %0.4f" % (f, skift_clf.score(X_test, y_test)))
    oversample_clfs.append(skift_clf)

score_preds(y_test, ensemble_predict(oversample_clfs, X_test))

oversampling fraction: 0.0800 // score: 0.9342
oversampling fraction: 0.0825 // score: 0.9351
oversampling fraction: 0.0850 // score: 0.9233
oversampling fraction: 0.0875 // score: 0.9355
oversampling fraction: 0.0900 // score: 0.9357
oversampling fraction: 0.0925 // score: 0.9353
oversampling fraction: 0.0950 // score: 0.9333
oversampling fraction: 0.0975 // score: 0.9141
oversampling fraction: 0.1000 // score: 0.9354
oversampling fraction: 0.1025 // score: 0.9192
oversampling fraction: 0.1050 // score: 0.9235
oversampling fraction: 0.1075 // score: 0.9196
confusion matrix:
[[56181  1707]
 [ 2477  3613]]
classification report:
             precision    recall  f1-score   support

          0       0.96      0.97      0.96     57888
          1       0.68      0.59      0.63      6090

avg / total       0.93      0.93      0.93     63978

f1 macro: 0.7987
f1 micro: 0.9346


# Bagging ensemble - Oversampling w/o seed - 100 bags

In [11]:
# Oversampling different fractions and score

oversample_clfs = list()

for i in range(8):
    for f in list(np.arange(0.08, 0.11, 0.0025)):
        X_resampled, y_resampled = oversample(X_train, y_train, 1.25, f)
        skift_clf = skift.FirstObjFtClassifier(lr=0.2)
        skift_clf.fit(X_resampled, y_resampled)
        skift_clf.model.quantize()
        print("oversampling fraction: %0.4f // score: %0.4f" % (f, skift_clf.score(X_test, y_test)))
        oversample_clfs.append(skift_clf)

score_preds(y_test, ensemble_predict(oversample_clfs, X_test))

oversampling fraction: 0.0800 // score: 0.9200
oversampling fraction: 0.0825 // score: 0.9362
oversampling fraction: 0.0850 // score: 0.9376
oversampling fraction: 0.0875 // score: 0.9179
oversampling fraction: 0.0900 // score: 0.9224
oversampling fraction: 0.0925 // score: 0.9356
oversampling fraction: 0.0950 // score: 0.9195
oversampling fraction: 0.0975 // score: 0.9128
oversampling fraction: 0.1000 // score: 0.9161
oversampling fraction: 0.1025 // score: 0.9117
oversampling fraction: 0.1050 // score: 0.9194
oversampling fraction: 0.1075 // score: 0.9133
oversampling fraction: 0.0800 // score: 0.9364
oversampling fraction: 0.0825 // score: 0.9238
oversampling fraction: 0.0850 // score: 0.9172
oversampling fraction: 0.0875 // score: 0.9376
oversampling fraction: 0.0900 // score: 0.9367
oversampling fraction: 0.0925 // score: 0.9196
oversampling fraction: 0.0950 // score: 0.9259
oversampling fraction: 0.0975 // score: 0.9382
oversampling fraction: 0.1000 // score: 0.9381
oversampling 

# Result comparison - single models
**Single model, original data, default parameters**  
``
[[54325  3563]                        f1 macro: 0.8143  
 [ 1218  4872]]                       f1 micro: 0.9253  
                                      f1 micro on training data: 0.9722  
``

**Single model, preprocessed data, default parameters**  
``
[[54324  3564]                        f1 macro: 0.8141  
 [ 1222  4868]]                       f1 micro: 0.9252  
                                      f1 micro on training data: 0.9722  
``

**Single model, preprocessed data, wordNgrams=2, maxn=3, dim=300**  
``
[[55608  2280]                        f1 macro: 0.8132  
 [ 1939  4151]]                       f1 micro: 0.9341  
                                      f1 micro on training data: 0.9586  
``

# Result comparison - ensembles
**10 Folds, preprocessed data, minn=3, maxn=3, wiki-news-300d-1M-subword.vec**  
``
[[54887  3001]                        f1 macro: 0.8194  
 [ 1431  4659]]                       f1 micro: 0.9307  
``

**10 Stratified Folds, preprocessed data, minn=3, maxn=3, wiki-news-300d-1M-subword.vec**  
``
[[54915  2973]                        f1 macro: 0.8200  
 [ 1436  4654]]                       f1 micro: 0.9311  
``

**Oversampling ensemble, preprocessed data, 8 bags, lr=0.2**  
``
[[56181  1707]                        f1 macro: 0.7987  
 [ 2477  3613]]                       f1 micro: 0.9346  
``

**Oversampling ensemble, preprocessed data, 96 bags, lr=0.2**  
``
[[54913  2975]                        f1 macro: 0.8195  
 [ 1444  4646]]                       f1 micro: 0.9309  
``


# Oversampling Ensemble Histogram  
---

![OversamplingHistogram.png](images/OversamplingHistogram.png)

# Identity hate

In [8]:
X_train, X_test, y_train, y_test = load_train_test_data("data/train_unidecode.csv", "data/test_unidecode.csv", "data/test_labels.csv", "identity_hate")

skift_clf = skift.FirstObjFtClassifier()
skift_clf.fit(X_train, y_train)
preds = skift_clf.predict(X_test)
score_preds(y_test, preds)
print("f1 micro on training data: %0.4f" % (skift_clf.score(X_train, y_train)))

confusion matrix:
[[63026   240]
 [  479   233]]
classification report:
             precision    recall  f1-score   support

          0       0.99      1.00      0.99     63266
          1       0.49      0.33      0.39       712

avg / total       0.99      0.99      0.99     63978

f1 macro: 0.6938
f1 micro: 0.9888
f1 micro on training data: 0.9934


## Conclusion

* Random results - due to initialization of neural net's weights - make result comparison difficult
* Ensembles to stabilize the results
* Really unclear on how some parameters improve the score i.e. pretrained vectors
* Usage within scikit-learn difficult, if you don't have numeric predictors 

## More ideas

* GridSearch on "good" fastText hyperparameters
* Generate many models and persist one that scores high
* Continousliy improve this persisted model
* More detailed comparison of the probabilities of FPs/FNs of different models


# Tfidf Method using Scikit vectorizer

This method was inspired by one of the Kaggle Competitors who used sklearn to implement a Logistic regression with words & char n grams. And his work achieved a better score only to mention that it doesn't use fastText at all for it's implementation.

Source: https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams/code 

A few edits were made to create the following result:
* Test CV score (ROC AUC) for class toxic is 0.957
* Test CV score (ROC AUC) for class identity_hate is 0.975


In [4]:
import numpy as np
import pandas as pd
from os import chdir, path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
chdir(path.dirname(path.abspath('__file__')))
train = pd.read_csv('data/train.csv').fillna(' ')
test = pd.read_csv('data/test.csv').fillna(' ')

df_test = pd.merge(pd.read_csv("data/test.csv"),
                   pd.read_csv("data/test_labels.csv"),
                   how="inner",
                   on="id")

#y_test = df_test[df_test["toxic"]>-1].loc[:,"toxic"]
train_text = train['comment_text']
test_text = pd.DataFrame(df_test[df_test["toxic"]>-1].loc[:,"comment_text"])["comment_text"]
all_text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

scores = []
scores_test = []
submission = pd.DataFrame.from_dict({'id': test['id']})
for class_name in ["toxic"]:

    train_target = train[class_name]
    test_target = df_test[df_test[class_name]>-1].loc[:,"toxic"]

    classifier_test = LogisticRegression(C=0.1, solver='sag')
    classifier_test.fit(train_features, train_target)

    cv_score_test = np.mean(cross_val_score(classifier_test, test_features, test_target, cv=3, scoring='roc_auc'))
    scores_test.append(cv_score_test)
    print('Test CV score for class {} is {}'.format(class_name, cv_score_test))

    #classifier.fit(train_features, train_target)
    #submission[class_name] = classifier.predict_proba(test_features)[:, 1]

#print('Total CV score is {}'.format(np.mean(scores)))
#print('Total Test CV score is {}'.format(np.mean(scores_test)))

#submission.to_csv('submission.csv', index=False)

Test CV score for class toxic is 0.9567386963649188


In [1]:

import numpy as np
import pandas as pd
from os import chdir, path

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack

class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
chdir(path.dirname(path.abspath('__file__')))
train = pd.read_csv('data/train.csv').fillna(' ')
test = pd.read_csv('data/test.csv').fillna(' ')

df_test = pd.merge(pd.read_csv("data/test.csv"),
                   pd.read_csv("data/test_labels.csv"),
                   how="inner",
                   on="id")

#y_test = df_test[df_test["toxic"]>-1].loc[:,"toxic"]
train_text = train['comment_text']
test_text = pd.DataFrame(df_test[df_test["identity_hate"]>-1].loc[:,"comment_text"])["comment_text"]
all_text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=10000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(2, 6),
    max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)

train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])

scores = []
scores_test = []
submission = pd.DataFrame.from_dict({'id': test['id']})
for class_name in ["identity_hate"]:

    train_target = train[class_name]
    test_target = df_test[df_test[class_name]>-1].loc[:,"identity_hate"]

    classifier_test = LogisticRegression(C=0.1, solver='sag')
    classifier_test.fit(train_features, train_target)

    cv_score_test = np.mean(cross_val_score(classifier_test, test_features, test_target, cv=3, scoring='roc_auc'))
    scores_test.append(cv_score_test)
    print('Test CV score for class {} is {}'.format(class_name, cv_score_test))

    #classifier.fit(train_features, train_target)
    #submission[class_name] = classifier.predict_proba(test_features)[:, 1]

#print('Total CV score is {}'.format(np.mean(scores)))
#print('Total Test CV score is {}'.format(np.mean(scores_test)))

#submission.to_csv('submission.csv', index=False)


Test CV score for class identity_hate is 0.9748963999977726


# Feedback 

fT by default uses the dataset it sees to build word embeddings  

for translation use https://www.linguee.de/, they also made the DeepL Translator  

regarding preprocessing: 
- fT removes punctuation (and thus variance), that's why it's important to e.g. replace smileys by words to keep this variance

parameter tuning is good, more importantly further research in this directions:
- get complementary data (like the back and forth translations)
- create own word/character embeddings on additional corpous (search the web for open data sources)
- create silver standard: computational labelling of unlabeled data 
- feature engineering (less important)

it's important to create extra signals by generating or adding additional data 