### Kaggle - Toxic Comment Classification Challenge (2018)

The goal is to build a sentiment analysis model that can classify wiki comments into 'toxic', 'severe toxic', 'obscene', 'threat', 'insult' and 'identity_hate'.

The dataset consists of 159,000 comments extracted from Wiki's talk page edits.

A BiLSTM model was used to build the classifier and an accuracy of 98.26% was achieved.

In [None]:
import numpy as np 
import pandas as pd


In [None]:
from keras.models import Model
from keras.layers import Dense, Embedding, Input
from keras.layers import LSTM, Bidirectional, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.callbacks import EarlyStopping, ModelCheckpoint

max_features = 20000
maxlen = 100


### Load Data

Load and extract the train and test datasets

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")


In [None]:
list_sentences_train = train["comment_text"].fillna("CVxTz").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = train[list_classes].values
list_sentences_test = test["comment_text"].fillna("CVxTz").values


### Preprocessing the data

(1) tokenize and convert all the sentences to sequences

(2) pad sequences in order to keep the max_len of the sequences consistent

In [None]:
tokenizer = text.Tokenizer(num_words=max_features)

In [None]:
tokenizer.fit_on_texts(list(list_sentences_train))

In [None]:
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)

In [None]:
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

In [None]:
X_t = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen)

In [None]:
X_te = sequence.pad_sequences(list_tokenized_test, maxlen=maxlen)

### Model Definition

The model consists of an Embedding layer to convert the sequences into useful representations of the words. This layer is followed by the BiLSTM layer which learns the temporal aspects of the inputs effectively. Sigmoid activation function was used to predict the output.

In [None]:
def get_model():
    embed_size = 128
    inp = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size)(inp)
    x = Bidirectional(LSTM(50, return_sequences=True))(x)
    x = GlobalMaxPool1D()(x)
    x = Dropout(0.1)(x)
    x = Dense(50, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(6, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

In [None]:
model = get_model()
batch_size = 32
epochs = 20

model.summary()

In [None]:
file_path="weights_base.best.hdf5"

checkpoint = ModelCheckpoint(file_path, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

early = EarlyStopping(monitor="val_loss", mode="min", patience=20)


callbacks_list = [checkpoint, early] #early
model.fit(X_t, y, batch_size=batch_size, epochs=epochs, validation_split=0.1,
          callbacks=callbacks_list)


### Model Inference

In [None]:
model.load_weights(file_path)

In [None]:
y_test = model.predict(X_te)

In [None]:
sample_submission = pd.read_csv("./input/sample_submission.csv")

In [None]:
sample_submission[list_classes] = y_test

In [None]:
sample_submission.to_csv("baseline.csv", index=False)