# CNN-GRU Model
Building a simple baseline model on the toxic comments dataset to explore multi-label classification.

This model is an application of theory learned from fchollet and johkhron excellent work in deep learning and has inspired me to advance my knowledge in various deep learning architectures and testing them on various datasets. 

The baseline model is developed to show the potential on this particular dataset and set the foundations for improvement as well as motivation for more complex architectures. 

The initial model developed is a 1D CNN with a single GRU layer trained using the GloVe 100D word embeddings, with a modest accuracy of ~96% training and validation.  This model takes the same architecture and applies the 200D word embeddings (as it is easily located on Kaggle) and shows the modest performance improvement using simply larger word vector embeddings. 

In this notebook we will discuss data preprocessing, model training as well applying techniques in the future such as Dropout and Batch Normalization to further improve model generalization.

As I develop my own knowledge I aim to apply them in this notebook as a journey to track progress as well as inspire comments and discussions from other Kaggle members.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
import keras

In [None]:
# verify GPU acceleration
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

## Custom helper functions
Functions for reading in embedding data and plotting model accuracy

In [None]:
# functions for reading in embedding data and
# tokenizing and processing sequences with padding and
# function for plotting model accuracy and loss

import os
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# vectorizer and sequence function
# takes in raw text and labels
# params for max sequence length and max words
# default arg for Shuffle=True to randomise data
# returns tokenizer object. x_train,y_train, x_val,y_val
def tokenize_and_sequence(full_data_set, texts, labels, max_len, max_words, validation_samples, shuffle=True):
    #initialise tokenizer with num_words param
    tokenizer = Tokenizer(num_words=max_words)
    tokenizer.fit_on_texts(full_data_set)
    # convert texts to sequences
    sequences = tokenizer.texts_to_sequences(texts)
    # generate work index
    word_index = tokenizer.word_index
    # print top words count
    print('{} of unique tokens found'.format(len(word_index)))
    # pad sequences using max_len param
    data = pad_sequences(sequences, maxlen=max_len)
    # convert list of labels into numpy array
    labels = np.asarray(labels)
    # print shape of text and label tensors
    print('data tensor shape: {}\nlabel tensor shape:{}'.format(data.shape, labels.shape))

    # shuffle data=True as labels are ordered
    # randomise data to vary class distribution
    if shuffle:
        # get length of data sequence and create array
        indices = np.arange(data.shape[0])
        np.random.shuffle(indices)
        # shuffle data and labels
        data = data[indices]
        labels = labels[indices]
    else:
        pass

    # split training data into training and validation splits
    # split using validation length
    # validation split
    x_val = data[:validation_samples]
    y_val = labels[:validation_samples]
    # training split
    x_train = data[validation_samples:]
    y_train = labels[validation_samples:]

    # return tokenizer, word_index, training and validation data
    return tokenizer, word_index, x_train, y_train, x_val, y_val


# function to lpad pretrained glove embeddings
# takes in embedding dim for variable embedding sizes
# and base directory as well as txt file
# embedding dim should match the file name dimension
# and max words and word_index for embedding features
def load_glove(base_directory, f_name, max_words, word_index, embedding_dim=None):
    # check file name ends in .txt
    # read file name embedding value if not specified
    if f_name[-4:] == '.txt':
        # check embedding value
        if embedding_dim is not None:
            dim = f_name[-8:-5]
            dim = int(dim)
            embedding_dim = dim
        else:
            # assuming dimension is not none for manual input
            pass
        # continue

        # create embedding dictionary
        embeddings_index = {}
        # open embeddings file
        try:
            f = open(os.path.join(base_directory, f_name))
            # iterate over lines and split on individual words
            # split coefficient of word values
            # map words and coefficients to embeddings dictionary
            for line in f:
                values = line.split() # returns list of [word, coeff]
                word = values[0] # gets first list element
                coeff = np.asarray(values[1:], dtype='float32')  # slice coefficiennt value array from remainder of list
                # assign mapping to dictionary
                embeddings_index[word] = coeff
            f.close()
        except IOError:
            print('cannot read file. check file paths')

        # prepare glove word-embedding matrix
        # create empty embedding tensor
        embedding_matrix = np.zeros((max_words,embedding_dim ))
        # map the top words of the data into the glove embedding matrix
        # words not found from the data in glove will be zeroed
        for word, i in word_index.items():
            if i < max_words:
                embedding_vector = embeddings_index.get(word)
                if embedding_vector is not None:
                    embedding_matrix[i] = embedding_vector

        # return embedding matrix
        return embedding_matrix


# function to visualise keras model history metrics
# function takes in acc, val_acc, loss, val_loss for model params
# range is defined by epochs in range len(acc)

import matplotlib.pyplot as plt

def plot_training_and_validation(acc, val_acc, loss, val_loss):
    epochs = range(1, len(acc) + 1)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.legend()
    plt.figure()
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.legend()
    plt.show()

# end

# 1 - Baseline CNN-GRU Model


# 1.1 - Loading Data
In this section we load the raw data into a Pandas DataFrame which will help in isolating the comments and labels into lists of arrays, ready for vectorization in 1.2

In [None]:
import pandas as pd

In [None]:
# set file paths for training and test files
# base directory
toxic_base = '../input/jigsaw-toxic-comment-classification-challenge/'
toxic_train = toxic_base + 'train.csv'
toxic_test = toxic_base + 'test.csv'
# load train and test data into DataFrames
train_df = pd.read_csv(toxic_train)
test_df = pd.read_csv(toxic_test)

In [None]:
# verify and inspect data
# training data
train_df.head()

In [None]:
print(train_df.shape)

In [None]:
# test data
test_df.head()

In [None]:
test_df.shape

In [None]:
# extract label columns
cols = list(train_df.columns)
print(type(cols))
print(list(cols))

In [None]:
# remove id and comment_text columns
cols = cols[2:]
print(cols)

In [None]:
# describe data and check for any null or missing values, and the spread of labels
train_df.describe()

In [None]:
# check for null values in labels in training data
train_df.isnull().any()

In [None]:
# check for null values in test data
test_df.isnull().any()

In [None]:
# view spread of classess across the training data
for label in cols:
    print('label: {}'.format(label))
    for x in [0, 1]:
        print('value: {}, total:{}'.format(x, train_df[label].eq(x).sum()))


# 1.2 Split data into text and labels
Split training data into text and label arrays as well as test text

In [None]:
# split class labels into a y array
y_train = train_df[cols].values

In [None]:
# verify labels are arrays of 6 values
y_train[0]

In [None]:
# convert to numpy array
y_train = np.asarray(y_train)

In [None]:
# verify matrix shape
y_train.shape

In [None]:
# split comment_text from training and test data
comment = 'comment_text'
# extract Series object from DataFrames for train and test 
sentences_train = train_df[comment]
sentences_test = test_df[comment]

In [None]:
# transform Series into list of text values
x_train = list(sentences_train)
x_test = list(sentences_test)

In [None]:
# verify train and test text
print(x_train[0], '\n')
print(x_test[0])

# 1.3 Vectorize training and test data
Using the custom functions to load in raw text and label data and vectorize them based on a set of parameters.

Parameters such as max_len, max_words and validation_samples for the max sequence length of each text, total number of words for the tokenizer and finally, the split for validation data.

The baseline model will use a max_len of 100, max_words of 10,000 and a validation split of 10% of the training data

In [None]:
# check training data sample size
print(len(x_train))

In [None]:
# get the value of 10% of the traning data
print(len(x_train) // 10)

In [None]:
validation_samples = int((len(x_train) // 10))

In [None]:
validation_samples

In [None]:
# define max sequence length and total dictionary words
max_len = 100
max_words = 10000

In [None]:
# Vectorize training data and return tokenizer and word_index as well as validation splits
tokenizer, word_index, X_train, Y_train, x_val, y_val = tokenize_and_sequence(
    x_train, y_train, max_len=max_len, max_words=max_words, validation_samples=validation_samples, shuffle=False)

In [None]:
# verify train and validation text and labels
print('training:',X_train.shape, Y_train.shape, '\nvalidation:', x_val.shape, y_val.shape)

# 1.4 Load pre-trainged GloVe word embeddings
Loading in the 200D word embeddings to establish a baseline model

In [None]:
# define directory paths for glove embeddings
glove_dir = '../input/glove6b200d/'

In [None]:
# glove file name
glove_file = 'glove.6B.200d.txt'

In [None]:
# define embedding dimension value to match the glove200d file
embedding_dim = 200

In [None]:
# load in glove embedding using custom function from earlier
# function takes as input the raw file, word_index returned from the tokenizer and max_words
glove_embedding_200d = load_glove(
    glove_dir, glove_file, max_words=max_words, word_index=word_index, embedding_dim=embedding_dim)

In [None]:
# verify embeddings loaded correctly
glove_embedding_200d.shape

# 2. Model Architectures


# 2.1 CNN-GRU Baseline
Design a simple model using a 1D convolution layer followed by a GRU layer to establish a benchmark performance score for further improvements.

Using the pre-trained embeddings with weights frozen to prevent re-training of word vectors during model training.

As this is a multi-label multi-class problem, loss is calculated using binary crossentropy with an adam optimizer. Further experiments with hyperparameter turning will explore various optimizer peformance on this dataset.

Training for 5 epochs on batch sizes of 32

## Design
A sequential model with the following layers:
- Embedding(dimension=200)
- Conv1D(64, 3, 'relu') *64 convolutions with a kernel size of 3, can be extended up to 7*
- MaxPooling1D(4) *standard practive following convolutions*
- GRU(64, dropout=0.1, recurrent_dropout=0.5) *using a layer dropout of 10% and a recurent unit dropout of 50%, as seen in research to return good performance*
- Dense(6, activation='sigmoid') *dense classifier layer of six outputs*



In [None]:
# import layers
from keras.layers import Input, Embedding, GRU, LSTM, MaxPooling1D, GlobalMaxPool1D
from keras.layers import Dropout, Dense, Activation, Flatten,Conv1D, SpatialDropout1D
from keras.models import Sequential
from keras.optimizers import RMSprop 

In [None]:
# import AUC ROC metrics from sklearn
from sklearn.metrics import roc_auc_score

In [None]:
# define model architecture
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_len))
model.add(Conv1D(64, 3, activation='relu'))
model.add(MaxPooling1D(4))
model.add(GRU(64, dropout=0.1, recurrent_dropout=0.5)) # defaults inclide tanh activation
model.add(Dense(6, activation='sigmoid'))
model.summary()

In [None]:
# load pre-trained Glove embeddings in the first layer
model.layers[0].set_weights([glove_embedding_200d])
# freeze embedding layer weights
model.layers[0].trainable = False
# compile model with adam optimizer
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# fit model and train on training data and validate on validation samples
# train for 5 epochs to establish baseline overfitting model
# saves results to histroy object
history = model.fit(X_train, Y_train, epochs=5, batch_size=32, validation_data=(x_val, y_val))

In [None]:
# save model
model.save('cnn_gru_200d.h5')

## 2.1.1 Model Results
Plot model training and validation performance


In [None]:
# define plotting metrics
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
# plot model training and validation accuracy and loss
plot_training_and_validation(acc, val_acc, loss, val_loss)

Let's use the AUC ROC score metrics for a slightly better understanding of model performance.

The training and validation performance remains consistent but let's review it on the validation data and plot the AUC graph for a better representation of model generalization

In [None]:
y_hat = model.predict(x_val)

In [None]:
# print auc roc score
"{:0.2f}".format(roc_auc_score(y_val, y_hat)*100.0)

# Conclusion
With an auc roc score of a modest 96.19%, we have established a strong baseline model that uses a 1D convolution and a GRU layer. 

As a first experiment to test this type of architecture on a multi-label classification task on natural language data, it shows the effeciveness of using a convolution base with larger (over 100d) pre-trained word embeddings along with a less compute intensive recurrent layer such as the GRU. 

With the leader achieving an accuracy of 98.85%, further enhacements to this model such as dropout, batch normalisation, increased layers and units, as well as hyper-parameter tuning could lead to performance close to or matching those scores.

As this is my first kernel on Kaggle, I hope my code and notebook is a good introduction to approaching text based classification problems.


# 2.2 Baseline Improvement

## Experiment with model architecture, model and data complexity 
Having established a reasonable baseline model using a simple CNN-GRU architecture, we take the learning from our previous experiment and add more complex features as well as modify input data. 

The main improvements we will add are:
- Larger vectors
- Shorter sequences
- Vectorize on both training and test data
- Callbacks: Early stopping, Reduce LR, ROC per epoch
- Larger convolution kernels
- Dropout
- Batch Normalization

Extended modifications we can test:
- Larger units
- More Layers
- Stacked recurrent layers
- Bidirectional recurrent layer



## 2.2.1 Vectorize input data
In this experiment we fit the Tokenizer on both the training and test data, to capture as many relevant words as possible before padding the sequences and extracting their relevant word embeddings.

The process is identical to Section 1.3

In [None]:
# verify length of training and test data
print(len(x_train), len(x_test))

In [None]:
# create validation split of 10%
validation_split = len(x_train) // 10
print(validation_split)

### Define sequence length and max number of words
Here we reduce the sequence length to 50 as our EDA showed most of our samples are less than 60 sequences in length.

We also increase the total number of words to 20,000 for the same intended effect, for larger coverage.


In [None]:
# define max_len and max_words
max_len = 50
max_words = 20000

In [None]:
full_tokenized = x_train + x_test

In [None]:
# repeat vectorization process using our custom function
# Vectorize training data and return tokenizer and word_index as well as validation splits
# tokenize on x_train AND x_test using new parameter to capture as many words as possible
tokenizer, word_index, X_train, Y_train, x_val, y_val = tokenize_and_sequence(full_tokenized,
    x_train, y_train, max_len=max_len, max_words=max_words, 
    validation_samples=validation_split, shuffle=False)

## 2.2.2 Load pre-trained GloVe embeddings

In [None]:
# define directory paths for glove embeddings
glove_dir = '../input/glove6b200d/'

In [None]:
# glove file name
glove_file = 'glove.6B.200d.txt'

In [None]:
# define embedding dimension value to match the glove200d file
embedding_dim = 200

In [None]:
# load in glove embedding using custom function from earlier
# function takes as input the raw file, word_index returned from the tokenizer and max_words
glove_embedding_200d = load_glove(
    glove_dir, glove_file, max_words=max_words, word_index=word_index, embedding_dim=embedding_dim)

In [None]:
# verify embeddings loaded correctly
glove_embedding_200d.shape

## 2.2.3 Baseline++ Model
We take the same core design of the baseline model of a CNN-GRU but add features such as Dropout, Batch normalization. 

We optimise the model with callbacks and monitor progress to compare if the modifications in input sequence length and size improves our results.


In [None]:
# import AUC ROC metrics from sklearn
from sklearn.metrics import roc_auc_score

# define class for ROC AUC callback with simple name modifications
# credit to https://www.kaggle.com/yekenot
class roc_auc_validation(keras.callbacks.Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()
        self.interval = interval
        self.x_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.x_val, verbose=0)
            score = roc_auc_score(self.y_val, y_pred)
            print("\n ROC-AUC - epoch: {:d} - score: {:.6f}".format(epoch+1, score))


In [None]:
# import keras layers 
from keras.layers import Input, Embedding, GRU, LSTM, MaxPooling1D, GlobalMaxPool1D, CuDNNGRU, CuDNNLSTM
from keras.layers import Dropout, Dense, Activation, Flatten,Conv1D, Bidirectional, SpatialDropout1D, BatchNormalization
from keras.models import Sequential
from keras.optimizers import RMSprop 

In [None]:
# define model architecture
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_len))
model.add(SpatialDropout1D(0.2)) # add spatial dropout
model.add(Conv1D(64, 3, activation='relu')) # increase kernel size to 5 # change to 3 to test
model.add(MaxPooling1D(4))
model.add(BatchNormalization()) # add batch normalization
model.add(Dropout(0.1))
# modify to CuDNNGRU
#model.add(GRU(64, dropout=0.1, recurrent_dropout=0.5)) # defaults inclide tanh activation
model.add(CuDNNGRU(64)) # does not have a dropout or recurrent dropout param
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(6, activation='sigmoid'))
model.summary()

# 2.2.3.1 Increasing Baseline++ complexity
The performance of our Baseline++ remains constant at around 98% for 20 epochs without overfitting, suggesting there is still room for improvement. 

In this subsection we increase the units of both the CNN and GRU layers and test the model once again. We also change callback parameters to stop early by monitoring val_loss

Results - Early stopping at epoch 7 with accuracy again at 98%.

Experiement 2 
Add a second convolution layer and change kernels to 7 , 5 in each
Results = Early stopping at epoch 5 with accuracy at 97%

Experiement 3
Return to ++ model with a Bidirectional GRU layer instead of a single one
Results = Epoch 18/20
143614/143614 [==============================] - 21s 144us/step - loss: 0.0403 - acc: 0.9849 - val_loss: 0.0733 - val_acc: 0.9767

 ROC-AUC - epoch: 18 - score: 0.949673
Epoch 19/20
143614/143614 [==============================] - 21s 143us/step - loss: 0.0396 - acc: 0.9849 - val_loss: 0.0730 - val_acc: 0.9767

Experiment 4 - Test this model using rmsprop optimizer
Results = highest training accuracy of 98.6% but early stopping at epoch 4, with ROC of 95%

Experiment 5 - Test again using 128 batch size
Results = Early stopping at epoch 2
Conclusion = adam optimiser is better for this type of problem

Experiment 6 - Add a second convolution layer, 128, 64, same parameters
Results = more compute, weaker ROC at 93, training at 97

Experiment 7 - Return to baseline++ of 64,64 cnn,gru, reduce filter size to 3 as this is a small sequential data set
Results = Highest ROC from epoch 1 at 95, and 97 with a peak of 98% in epoch 10. Training accuracy is constant around 98% as is validation accuracy up to 98%. This is the best result I have observed so far. A shorter convolution kernel of 4, with 64 filters and a 64 unit GRU is best suited for this data set, which makes sense considering we are using a sequence size of 50 as the text sequences themselves are relatively short. This also shows that a simpler network architecture is better rather than adding more layers of branches.


In [None]:
# Baseline++ with more units
# this model performs the best with the highest training accuracy of loss: 0.0378 - acc: 0.9857 
# - val_loss: 0.0763 - val_acc: 0.9759 # ROC-AUC - epoch: 20 - score: 0.951627
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_len))
model.add(SpatialDropout1D(0.2)) # add spatial dropout
model.add(Conv1D(128, 5, activation='relu')) # increase kernel size to 5
model.add(MaxPooling1D(4))
model.add(BatchNormalization()) # add batch normalization
model.add(Dropout(0.1))

# modify to CuDNNGRU and double units to 128
#model.add(GRU(64, dropout=0.1, recurrent_dropout=0.5)) # defaults inclide tanh activation
model.add(Bidirectional(CuDNNGRU(128))) # does not have a dropout or recurrent dropout param
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(6, activation='sigmoid'))
model.summary()

In [None]:
# Baseline++ with 2 CNN layers
# less performant compared to baseline++
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_len))
model.add(SpatialDropout1D(0.2)) # add spatial dropout
# CNN layers
model.add(Conv1D(256, 5, activation='relu')) # increase kernel size to 5
model.add(MaxPooling1D(4))
model.add(BatchNormalization()) # add batch normalization
model.add(Dropout(0.1))
# second CNN 
model.add(Conv1D(128, 7, activation='relu')) # increase kernel size to 5
model.add(MaxPooling1D(4))
model.add(BatchNormalization()) # add batch normalization
model.add(Dropout(0.1))
# modify to CuDNNGRU and double units to 128
#model.add(GRU(64, dropout=0.1, recurrent_dropout=0.5)) # defaults inclide tanh activation
model.add(Bidirectional(CuDNNGRU(64))) # does not have a dropout or recurrent dropout param
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(6, activation='sigmoid'))
model.summary()

In [None]:
# highest performing architecture
# baseline++ with shorter convolution kernel of 3
# define model architecture
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=max_len))
model.add(SpatialDropout1D(0.2)) # add spatial dropout
model.add(Conv1D(64, 3, activation='relu')) # increase kernel size to 5 # change to 3 to test
model.add(MaxPooling1D(4))
model.add(BatchNormalization()) # add batch normalization
model.add(Dropout(0.1))
# modify to CuDNNGRU
#model.add(GRU(64, dropout=0.1, recurrent_dropout=0.5)) # defaults inclide tanh activation
model.add(CuDNNGRU(64)) # does not have a dropout or recurrent dropout param
model.add(BatchNormalization())
model.add(Dropout(0.1))
model.add(Dense(6, activation='sigmoid'))
model.summary()

# best performing model training results
Train on 143614 samples, validate on 15957 samples
Epoch 1/20
143614/143614 [==============================] - 10s 72us/step - loss: 0.2060 - acc: 0.9235 - val_loss: 0.0596 - val_acc: 0.9798

 ROC-AUC - epoch: 1 - score: 0.959088
Epoch 2/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0643 - acc: 0.9785 - val_loss: 0.0536 - val_acc: 0.9810

 ROC-AUC - epoch: 2 - score: 0.972622
Epoch 3/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0590 - acc: 0.9795 - val_loss: 0.0524 - val_acc: 0.9814

 ROC-AUC - epoch: 3 - score: 0.974925
Epoch 4/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0554 - acc: 0.9806 - val_loss: 0.0517 - val_acc: 0.9813

 ROC-AUC - epoch: 4 - score: 0.976720
Epoch 5/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0535 - acc: 0.9810 - val_loss: 0.0522 - val_acc: 0.9808

 ROC-AUC - epoch: 5 - score: 0.976915
Epoch 6/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0516 - acc: 0.9815 - val_loss: 0.0506 - val_acc: 0.9817

 ROC-AUC - epoch: 6 - score: 0.978048
Epoch 7/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0504 - acc: 0.9818 - val_loss: 0.0502 - val_acc: 0.9814

 ROC-AUC - epoch: 7 - score: 0.979499
Epoch 8/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0490 - acc: 0.9821 - val_loss: 0.0502 - val_acc: 0.9816

 ROC-AUC - epoch: 8 - score: 0.979762
Epoch 9/20
143614/143614 [==============================] - 8s 57us/step - loss: 0.0481 - acc: 0.9825 - val_loss: 0.0495 - val_acc: 0.9817

 ROC-AUC - epoch: 9 - score: 0.979418
Epoch 10/20
143614/143614 [==============================] - 8s 57us/step - loss: 0.0472 - acc: 0.9828 - val_loss: 0.0500 - val_acc: 0.9816

 ROC-AUC - epoch: 10 - score: 0.979449
Epoch 11/20
143614/143614 [==============================] - 8s 57us/step - loss: 0.0463 - acc: 0.9829 - val_loss: 0.0497 - val_acc: 0.9818

 ROC-AUC - epoch: 11 - score: 0.980230
Epoch 12/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0456 - acc: 0.9831 - val_loss: 0.0504 - val_acc: 0.9813

 ROC-AUC - epoch: 12 - score: 0.979920
Epoch 13/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0451 - acc: 0.9833 - val_loss: 0.0504 - val_acc: 0.9810

 ROC-AUC - epoch: 13 - score: 0.980413
Epoch 14/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0446 - acc: 0.9835 - val_loss: 0.0502 - val_acc: 0.9810

 ROC-AUC - epoch: 14 - score: 0.980023
Epoch 15/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0440 - acc: 0.9836 - val_loss: 0.0504 - val_acc: 0.9813

 ROC-AUC - epoch: 15 - score: 0.979886
Epoch 16/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0435 - acc: 0.9838 - val_loss: 0.0514 - val_acc: 0.9811

 ROC-AUC - epoch: 16 - score: 0.979540
Epoch 17/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0429 - acc: 0.9839 - val_loss: 0.0508 - val_acc: 0.9817

 ROC-AUC - epoch: 17 - score: 0.979724
Epoch 18/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0427 - acc: 0.9839 - val_loss: 0.0509 - val_acc: 0.9815

 ROC-AUC - epoch: 18 - score: 0.979505
Epoch 19/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0423 - acc: 0.9841 - val_loss: 0.0510 - val_acc: 0.9810

 ROC-AUC - epoch: 19 - score: 0.979367
Epoch 20/20
143614/143614 [==============================] - 8s 56us/step - loss: 0.0420 - acc: 0.9842 - val_loss: 0.0511 - val_acc: 0.9809

 ROC-AUC - epoch: 20 - score: 0.979686


In [None]:
# define callbacks
from keras.callbacks import Callback, EarlyStopping, ReduceLROnPlateau

In [None]:
# initialise customer roc callback
roc_callback = roc_auc_validation(validation_data=(x_val, y_val), interval=1)

In [None]:
# define early stopping and reduce lr callbacks
callback_list = [keras.callbacks.EarlyStopping(monitor='acc', patience=1),
                 keras.callbacks.ModelCheckpoint(filepath='baseline_plus_complex.h5', monitor='val_loss',
                                                 save_best_only=True)]

In [None]:
# add roc to callbacks list
callback_list.append(roc_callback)

In [None]:
callback_list

In [None]:
# load pre-trained Glove embeddings in the first layer
model.layers[0].set_weights([glove_embedding_200d])
# freeze embedding layer weights
model.layers[0].trainable = False
# compile model with adam optimizer
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Train model with increased batch size and epoch range to let early stopping terminate training

In [None]:
# fit model and train on training data and validate on validation samples
# train for 5 epochs to establish baseline overfitting model
# saves results to histroy object
history = model.fit(X_train, Y_train, epochs=20, batch_size=256, callbacks=callback_list,validation_data=(x_val, y_val))

In [None]:
# tokenize and pad test data
X_test = tokenizer.texts_to_sequences(x_test)
X_test = pad_sequences(X_test, maxlen=max_len)

In [None]:
# verify test data shape
X_test.shape

In [None]:
# test model on submission as reading in test_labels fails
y_hat = model.predict(X_test, batch_size=256)

In [None]:
submission = pd.read_csv('../input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv')

In [None]:
submission[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]] = y_hat
submission.to_csv('submission_baseline_plus_best_k3.csv', index=False)

## 2.2.4 Conclusion
With a maximum training accuracy of 98.49%, we have reached the peak of the sequential model as it is still essentially a data distillation process. 

In the next experiment we will createa a multi-headed model taking the learnings of our experiments so far.

What we have learnt from this notebook is that a simpler model with a CNN and GRU of 64 units each with dropout and batch normalisation returns a test score of 96%, while a more complex model actually suffers by 3% on the test set. For this particular data set we maybe benefit from model ensembling to gain those last couple of accuracy points but it is clear that simpler models capture the features of this dataset better than more complex sequential models. There may also be greater performance to be gained by using larger dimensional word embeddings as well as some very limited data preprocessing such as removing numbers and punctuation.

In conclusion, this has been a good start to experimenting with multi-label multi-class classification and shows that a simple but well thought out model architecture paired with pre-trained word embeddings can lead to very good results, and sets the foundation for similar text classification problems.

Edit:

Having run more experiments, the best perfoming network is the baseline++ with a kernel size of 3, achieving the highest ROC accuracy consistently of 97/98% up from 95 in previous tests. Training and validation accuracy also remain constant around 98% which shows that modifying the architecture to be simpler and suit the nature of the input data yields better training and confidence. 

