# Baseline Model for the Toxic Comment Competition on Kaggle

https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

Based on the kernel from Jeremy Howard published on kaggle: https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout

Accuracy on validation set: 98.22%

In [1]:
import sys, os, re, csv, codecs, logging, numpy as np, pandas as pd

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, BatchNormalization
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.callbacks import ModelCheckpoint, EarlyStopping, Callback
from sklearn.metrics import roc_auc_score

Using TensorFlow backend.


We include the GloVe word vectors in our input files. To include these in your kernel, simple click 'input files' at the top of the notebook, and search 'glove' in the 'datasets' section.

In [5]:
path = 'data/'
comp = 'jigsaw-toxic-comment-classification-challenge/'
EMBEDDING_FILE=f'{path}glove.6B.50d.txt'
TRAIN_DATA_FILE=f'{path}{comp}train.csv'
TEST_DATA_FILE=f'{path}{comp}test.csv'

Set some basic config parameters:

In [6]:
embed_size = 50 # how big is each word vector
max_features = 60000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 400 # max number of words in a comment to use

Read in our data and replace missing values:

In [8]:
train = pd.read_csv(TRAIN_DATA_FILE)
test = pd.read_csv(TEST_DATA_FILE)

list_sentences_train = train["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("_na_").values

Standard keras preprocessing, to turn each comment into a list of word indexes of equal length (with truncation or padding as needed).

In [22]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_te = pad_sequences(list_tokenized_test, maxlen=maxlen)

Read the glove word vectors (space delimited strings) into a dictionary from word->vector.

In [11]:
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE))

Use these vectors to create our embedding matrix, with random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random init.

In [12]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std

(0.020940498, 0.6441043)

In [13]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Simple bidirectional LSTM with two fully connected layers. We add some dropout to the LSTM since even 2 epochs is enough to overfit.

In [14]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix],trainable = False)(inp)
x = Bidirectional(LSTM(128, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.5)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.4)(x)
x = Dense(6, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Now we're ready to fit out model! Use `validation_split` when not submitting.

In [15]:
filepath="weights.best.model" + ".hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early_stop = EarlyStopping(monitor='val_acc', patience=2, mode='max') 
callbacks_list = [checkpoint, early_stop]

model.fit(X_t, y, batch_size=64, epochs=20, validation_split=0.1,callbacks=callbacks_list);

Train on 143613 samples, validate on 15958 samples
Epoch 1/20

Epoch 00001: val_acc improved from -inf to 0.97871, saving model to weights.best.model.hdf5
Epoch 2/20

Epoch 00002: val_acc improved from 0.97871 to 0.98002, saving model to weights.best.model.hdf5
Epoch 3/20

Epoch 00003: val_acc improved from 0.98002 to 0.98067, saving model to weights.best.model.hdf5
Epoch 4/20

Epoch 00004: val_acc improved from 0.98067 to 0.98149, saving model to weights.best.model.hdf5
Epoch 5/20

Epoch 00005: val_acc improved from 0.98149 to 0.98151, saving model to weights.best.model.hdf5
Epoch 6/20

Epoch 00006: val_acc improved from 0.98151 to 0.98166, saving model to weights.best.model.hdf5
Epoch 7/20

Epoch 00007: val_acc improved from 0.98166 to 0.98210, saving model to weights.best.model.hdf5
Epoch 8/20

Epoch 00008: val_acc improved from 0.98210 to 0.98233, saving model to weights.best.model.hdf5
Epoch 9/20

Epoch 00009: val_acc improved from 0.98233 to 0.98245, saving model to weights.best.

In [16]:
y_test = model.predict([X_te], batch_size=1024, verbose=1)
sample_submission = pd.read_csv(f'{path}{comp}sample_submission.csv')
sample_submission[list_classes] = y_test
sample_submission.to_csv('submission.csv', index=False)



## Checking prediction with a sample from test set

In [24]:
model.load_weights("weights.best.model.hdf5")

In [36]:
test_sample = list_sentences_test[0:3].reshape(3,)
print(test_sample[0])
print("----------------------------------")
print(test_sample[1])
print("----------------------------------")
print(test_sample[2])

Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,
----------------------------------
== From RfC == 

 The title is fine as it is, IMO.
----------------------------------
" 

 == Sources == 

 * Zawe Ashton on Lapland —  /  "


In [37]:
tokenized_sample = tokenizer.texts_to_sequences(test_sample)
X_sample = pad_sequences(tokenized_sample, maxlen=maxlen)

y_test = model.predict([X_sample],verbose=0)


In [39]:
heading = np.append(["comment"], [list_classes]).reshape(1,7)
dd = np.append(test_sample.reshape(3,1), y_test, axis=1)

index = ['Row'+str(i) for i in range(1, len(dd)+1)]
df = pd.DataFrame(dd, index, columns=heading[0,0:])
df.tail()


Unnamed: 0,comment,toxic,severe_toxic,obscene,threat,insult,identity_hate
Row1,Yo bitch Ja Rule is more succesful then you'll...,0.999616,0.400736,0.964734,0.10676,0.886818,0.394428
Row2,== From RfC == \n\n The title is fine as it is...,0.000626001,1.39597e-08,0.00015868,1.95521e-07,3.66872e-05,9.68649e-06
Row3,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",0.000235214,3.05182e-09,3.70316e-05,1.34545e-07,9.7508e-06,2.96946e-06


Row1 has prob. of being toxic, insulting and obscene

Row2 can be considered as being a clean comment

Row3 can be considered as being a clean comment