# Model for the Toxic Comment Competition on Kaggle using FastText

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

Based on the kernel from Jeremy Howard published on kaggle: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout

Instead of using the Glove embedding, this is using the fasttext embedding (crawl_30d-2M.vec). I preprocessed the embedding with embedding_utils.py in order to check for dimensions and remove the first line

Further, it leverages the idea of Pavel Ostyakov’s discussed in the kaggle forum to augment the data sets by using machine translation of the provided data. Basically, I augmented just the training set with data resulting from translating the comments fromm its original into the languages DE, ES and FR and then back to english.
This additional data was added to the training set leading to about 650k labeled comments. I split of the validation set from the original data.

With that approach the validiation Accuracy improved to: 98.970% which is an improvement of about 0.5% compared to the version just using the Fasttext embedding.


In [109]:
import sys, os, re, csv, codecs, logging, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, BatchNormalization
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split

In [110]:
path = 'data/'
comp = 'jigsaw-toxic-comment-classification-challenge/'
EMBEDDING_FILE=f'{path}crawl_300d-2M_processed.vec'
#EMBEDDING_FILE=f'{path}glove.6B.50d.txt'
TRAIN_DATA_FILE=f'{path}{comp}train.csv'
TEST_DATA_FILE=f'{path}{comp}test.csv'
TRAIN_DATA_AUGMENTED=[f'{path}train_de.csv', f'{path}train_fr.csv', f'{path}train_es.csv']

Set some basic config parameters:

In [111]:
embed_size=300
max_features = 60000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 400 # max number of words in a comment to use

Read in our data and replace missing values:

In [112]:
def fillna(dataframe):
    df = dataframe
    return df["comment_text"].fillna("_na_").values

In [113]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = fillna(train)
list_sentences_test = fillna(test)
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]


train_y = train[list_classes].values


Augment training dats with datasets which are derived by translating them from english to an intermediate language (german, french, spanish) and back to english. This approach augments the training set significantly.

In [166]:
df1 = pd.read_csv(f'{path}train_de.csv')
df2 = pd.read_csv(f'{path}train_es.csv')
df3 = pd.read_csv(f'{path}train_fr.csv')

a_y1 = df1[list_classes].values
a_y2 = df2[list_classes].values
a_y3 = df3[list_classes].values

list_a_y = np.append(a_y1, values = a_y2, axis=0)
list_a_y = np.append(list_a_y, values = a_y2, axis=0)

list_a_train = np.append(fillna(df1),values=fillna(df2), axis=0)
list_a_train = np.append(list_a_train,values=fillna(df3), axis=0)


In [167]:
print(list_a_train.shape)

(478713,)


Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [186]:
tokenizer = Tokenizer(num_words=max_features)

# we are fitting the tokenizer on the full set of data including the augmented data
all_train = np.append(list_sentences_train,list_a_train)
tokenizer.fit_on_texts(all_train)
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
list_tokenized_train_a = tokenizer.texts_to_sequences(list_a_train)                     
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_t_a = pad_sequences(list_tokenized_train_a, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [169]:
all_train_y = np.append(train_y, list_a_y )

Read the embeddings vectors (space delimited strings) into a dictionary from word->vector.

In [170]:
def read_embedding_list(file_path):
    embedding_word_dict = {}
    embedding_list = []
    i = 0
    with open(file_path) as f:
        for row in f:
            data = row.split(" ")
            word = data[0]
            embedding = np.array([float(num) for num in data[1:-1]])
            embedding_list.append(embedding)
            embedding_word_dict[word] = embedding
 
    embedding_list = np.array(embedding_list)
    return embedding_list, embedding_word_dict

Use these vectors to create our embedding matrix, with random initialization for words that aren't in Fasttext. We'll use the same mean and stdev of embeddings the Fasttext has when generating the random init.

In [171]:
embedding_list, embeddings_index = read_embedding_list(EMBEDDING_FILE)

Determine mean values and standard deviation to initialize the embedding matrix. This ensures that all words which are not found in the embeddings are set to random value around the mean.

In [172]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.005528633668833318, 0.34703942181861597)

In [173]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [174]:
class RocAucEvaluation(Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            score = roc_auc_score(self.y_val, y_pred)
            print("\n ROC-AUC - epoch: %d - score: %.6f \n" % (epoch+1, score))



In [175]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix],trainable = False)(inp)
x = Bidirectional(LSTM(128, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.5)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.4)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [215]:
X_tra = []
y_tra = []

# we split on the original test data and then add the augmented data to the training set
# that ensures that we do not have the original data and augmented data on the same side
X_tra, X_val, y_tra, y_val = train_test_split(X_t, train_y, train_size=0.90, random_state=275)
print("Shape of X_tra   : " + str(X_tra.shape))
print("Shape of X_t_a   : " + str(X_t_a.shape))
X_tra = np.append(X_tra, X_t_a, axis=0)

print("Shape of X_tra   : " + str(X_tra.shape))
y_tra = np.append(y_tra, list_a_y, axis=0)
print("Shape of y_tra   : " + str(y_tra.shape))
RocAuc = RocAucEvaluation(validation_data=(X_val, y_val), interval=1)



Shape of X_tra   : (143613, 400)
Shape of X_t_a   : (478713, 400)
Shape of X_tra   : (622326, 400)
Shape of y_tra   : (622326, 6)


In [217]:
filepath="crawl_augmented_weights.best.model" + ".hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early_stop = EarlyStopping(monitor='val_acc', patience=2, mode='max') 
callbacks_list = [checkpoint, early_stop, RocAuc]

hist = model.fit(X_tra, y_tra, batch_size=64, epochs=20, validation_data=(X_val, y_val),callbacks=callbacks_list)


Train on 622326 samples, validate on 15958 samples
Epoch 1/20

Epoch 00001: val_acc improved from -inf to 0.98405, saving model to crawl_augmented_weights.best.model.hdf5

 ROC-AUC - epoch: 1 - score: 0.988710 

Epoch 2/20

Epoch 00002: val_acc improved from 0.98405 to 0.98527, saving model to crawl_augmented_weights.best.model.hdf5

 ROC-AUC - epoch: 2 - score: 0.991358 

Epoch 3/20

Epoch 00003: val_acc improved from 0.98527 to 0.98572, saving model to crawl_augmented_weights.best.model.hdf5

 ROC-AUC - epoch: 3 - score: 0.992253 

Epoch 4/20

Epoch 00004: val_acc improved from 0.98572 to 0.98633, saving model to crawl_augmented_weights.best.model.hdf5

 ROC-AUC - epoch: 4 - score: 0.993277 

Epoch 5/20

Epoch 00005: val_acc improved from 0.98633 to 0.98710, saving model to crawl_augmented_weights.best.model.hdf5

 ROC-AUC - epoch: 5 - score: 0.993645 

Epoch 6/20

Epoch 00006: val_acc improved from 0.98710 to 0.98797, saving model to crawl_augmented_weights.best.model.hdf5

 ROC-AUC

In [218]:
y_test = model.predict([X_te], batch_size=2048, verbose=1)
sample_submission = pd.read_csv(f'{path}{comp}sample_submission.csv')
sample_submission[list_classes] = y_test
sample_submission.to_csv('submission.csv', index=False)



## Checking prediction with a sample from test set

In [219]:
test_sample = list_sentences_test[0:2].reshape(2,)
print(test_sample[0])
print("----------------------------------")
print(test_sample[1])

Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,
----------------------------------
== From RfC == 

 The title is fine as it is, IMO.


In [220]:
tokenized_sample = tokenizer.texts_to_sequences(test_sample)
X_sample = pad_sequences(tokenized_sample, maxlen=maxlen)

y_test = model.predict([X_sample],verbose=0)


In [221]:
print(y_test)

[[9.9991131e-01 3.5480884e-01 9.9813110e-01 1.5038978e-03 9.7947401e-01
  6.3801211e-01]
 [4.7656525e-07 1.8563962e-15 5.0761123e-08 2.9353562e-21 9.4404473e-10
  2.9448616e-17]]


In [222]:
heading = np.append(["comment"], [list_classes]).reshape(1,7)
dd = np.append(test_sample.reshape(2,1), y_test, axis=1)

index = ['Row'+str(i) for i in range(1, len(dd)+1)]
df = pd.DataFrame(dd, index, columns=heading[0,0:])
df.tail()


Unnamed: 0,comment,toxic,severe_toxic,obscene,threat,insult,identity_hate
Row1,Yo bitch Ja Rule is more succesful then you'll...,0.999911,0.354809,0.998131,0.0015039,0.979474,0.638012
Row2,== From RfC == \n\n The title is fine as it is...,4.76565e-07,1.8564e-15,5.07611e-08,2.93536e-21,9.44045e-10,2.94486e-17


Row1 has prob. of being toxic, insulting and obscene

Row2 can be considered as being a clean comment