# Table of Contents
1. [Introduction](#1)
2. [Import Libraries and Datasets](#2)
3. [Training Data Pre-Processing](#3)
4. [Pre-trained Word Embeddings](#3)
5. [Convolutional Neural Network Model](#4)
 - 5.1 [Word2Vec Embeddings](#5)
 - 5.2 [GloVe Embeddings](#6)
 - 5.2 [FastText Embeddings](#7)
6. [Conclusions](#8)

## 1. Introduction

This kernel shows multiple Deep Learning models for Text Classification. Words are vectorized using multiple Word Embeddings (Word2Vec, GloVe and FastTest) both out-of-the-box and with additional training on the task's specific data. Models used are based on different architectures involving Convolutional and Recurrent Neural Networks. 

The notebook follows these steps:

 - Load train and test datasets
 - Feature extraction/vectorization of corpus with Word Embeddings
 - Training of multi-class categorization DL models for toxicity levels and type
 - Models hyperparameter tuning
 - Performance metrics and DL models comparison

## 2. Import Libraries and Datasets

In [None]:
import sys, os, gc
import zipfile
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
from keras.preprocessing import text, sequence
from keras.callbacks import Callback
import gensim.models.keyedvectors as word2vec

In [None]:
# unzip file to specified path
def import_zipped_data(file, output_path):
    with zipfile.ZipFile("../input//jigsaw-toxic-comment-classification-challenge/"+file+".zip","r") as z:
        z.extractall("/kaggle/working")
        
datasets = ['train.csv', 'test.csv', 'test_labels.csv', 'sample_submission.csv']

kaggle_home = '/kaggle/working'
for dataset in datasets:
    import_zipped_data(dataset, output_path = kaggle_home)

In [None]:
test_df = pd.read_csv('/kaggle/working/test.csv')
train_df = pd.read_csv('/kaggle/working/train.csv')
sample_input = pd.read_csv('/kaggle/working/sample_submission.csv')
test_labels = pd.read_csv('/kaggle/working/test_labels.csv')

In [None]:
train_df.head()

## 3. Training Data Pre-processing

In [None]:
TEXT = 'comment_text'
labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
# add label to mark non-toxic comments
train_df['non-toxic'] = 1 - train_df[labels].max(axis=1)
# replace na values with placeholder
train_df[TEXT].fillna("unknown", inplace=True)
test_df[TEXT].fillna("unknown", inplace=True)

In [None]:
# isolate classification labels and input text
y = train_df[labels]
list_sentences_train = train_df[TEXT]
list_sentences_test = test_df[TEXT]

In [None]:
max_features, maxlen = 20000, 200
# tokenize training and test data
tk = Tokenizer(num_words=max_features)
tk.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tk.texts_to_sequences(list_sentences_train)
list_tokenized_test = tk.texts_to_sequences(list_sentences_test)
# pad sequences for homogeneous length
X_train = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_test = pad_sequences(list_tokenized_test, maxlen=maxlen)

## 4. Pre-Trained Word Embeddings
Embeddings are numerical representations of tokens (e.g. words, n-grams) that encode information about their meaning and context. Intuitively, words that usually appear in similar contexts, will be assigned similar encodings.

In this notebook we use the following word embeddings:

### **1. Word2Vec**

**Word2Vec** embeddings trained on Google Negative News data - as negative words may be more informative to text toxicity classification.

Word2Vec trains a model on the context on each word, so similar words will have similar numerical representations. Each token is fed to the NN through an embedding layer initialized with random weights. The algorithm minimizes the loss of predicting the target words given the context words.


### **2. GloVe**

**GloVe** embeddings trained on Twitter data - trained on social media short messages, may be semantically similar to Wikipedia Comments data GloVe is similar to Word2Vec. 

GloVe learns by constructing a frequency co-occurrence matrix of size words times context. Since it's a very high-dimensional gigantic matrix, this matrix is factorized to achieve a lower-dimension representation.

### **3. FastText**

FastText is quite different from the above 2 embeddings. While Word2Vec and GLOVE treat each word as the smallest unit to train on, FastText uses n-gram characters as the smallest unit. This implies it can generate better word embeddings for rare or even new, unseen words.

The following function returns an embedding matrix with loaded weights from the selected pre-trained word embedding:

In [None]:
def get_coefs(word,*arr):
    return word, np.asarray(arr, dtype='float32')

def get_embedding_size(embedding):
    embedding_size = {'glove': 25, 'word2vec': 300, 'fasttext': 300}
    if embedding not in embedding_size.keys():
        print(f'Embedding type {embedding} is not supported')
        raise ValueError
    return embedding_size.get(embedding, None)
    
def build_matrix(embedding, max_features):
    # define sources and embedding size for each supported word embedding
    if(embedding=="glove"):
        embedding_idx = dict(get_coefs(*o.strip().split(" ")) for o in open('../input/glove-twitter/glove.twitter.27B.25d.txt'))
        embed_size = 25
    elif(embedding=="word2vec"):
        word2vec_dict = word2vec.KeyedVectors.load_word2vec_format("../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin", binary=True)
        embed_size = 300
        embedding_idx = {}
        for word in word2vec_dict.vocab:
            embedding_idx[word] = word2vec_dict.word_vec(word)
        # clean-up embedding dict
        del word2vec_dict
        gc.collect()
    elif(embedding=="fasttext"):
        embedding_idx = dict(get_coefs(*o.strip().split(" ")) for o in open('../input/fasttext/wiki.simple.vec'))
        embed_size = 300
    else:
        print(f'Embedding type {embedding} is not supported')
        raise ValueError

    # limit vocabulary size
    nb_words = min(max_features, len(tk.word_index))
    # create embedding matrix template filled with zeroes
    embedding_matrix = np.zeros((nb_words + 1, embed_size))
    # save all word embeddings common to training data and pre-trained corpus
    for word, i in tk.word_index.items():
        if i >= max_features: break
        # try to obtain embedding
        tmp = embedding_idx.get(word)
        # if word exists in pre-trained embeddings, add embedding to feature matrix
        if tmp is not None:
            embedding_matrix[i] = tmp
    return embedding_matrix

## 5. Convolutional Neural Network Model
First, we'll define a simple, parametrized model. It can be defined by specifying:
- Number of convolutional filters
- Size of each convolutional filter
- Embedding size
- Embedding matrix

In [None]:
def get_cnn_model(num_filters, filter_sizes, embed_size, embedding_matrix):    
    in_layer = Input(shape=(maxlen, ))
    x = Embedding(max_features, embed_size, weights=[embedding_matrix])(in_layer)
    x = SpatialDropout1D(0.4)(x)
    x = Reshape((maxlen, embed_size, 1))(x)
    
    conv, maxpool = [], []
    for i in range(len(filter_sizes)):
        conv.append(Conv2D(num_filters, kernel_size=(filter_sizes[i], embed_size), kernel_initializer='normal', activation='elu')(x))
        maxpool.append(MaxPool2D(pool_size=(maxlen - filter_sizes[i] + 1, 1))(conv[i]))
     
    z = Concatenate(axis=1)(maxpool)   
    z = Flatten()(z)
    z = Dropout(0.1)(z)
    out_layer = Dense(6, activation="sigmoid")(z)
    
    model = Model(inputs=in_layer, outputs=out_layer)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

### 5.1 Word2Vec Embeddings
Let's start loading the pre-trained `Word2Vec` embeddings:

In [None]:
embedding_matrix = build_matrix('word2vec', max_features = 20000)
embedding_matrix.shape

In [None]:
max_features, maxlen = embedding_matrix.shape

# tokenize training and test data
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)

# pad sequences for homogeneous length
X_train = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_test = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [None]:
batch_size = 256
epochs = 3
model = get_cnn_model(num_filters = 32, filter_sizes = [3,5], embed_size = get_embedding_size('word2vec'), embedding_matrix = embedding_matrix)
X_tra, X_val, y_tra, y_val = train_test_split(X_train, y[labels], train_size=0.95, random_state=233)
hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val))

In [None]:
y_pred = model.predict(X_val, batch_size=1024)
results = model.evaluate(X_val, y_val)
print('Test loss, Test acc:', results)

In [None]:
# save the model
model.save('cnn_word2vec.h5')
# reload_model = keras.models.load_model('cnn_word2vec.h5')

### 5.2 GloVe Embeddings

We can repeat the same exercise using a different embedding. Let's take `GloVe`:

In [None]:
embedding_matrix = build_matrix('glove', max_features = 20000)
embedding_matrix.shape

In [None]:
max_features, maxlen = embedding_matrix.shape

# tokenize training and test data
tk = Tokenizer(num_words=max_features)
tk.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tk.texts_to_sequences(list_sentences_train)
list_tokenized_test = tk.texts_to_sequences(list_sentences_test)

# pad sequences for homogeneous length
X_train = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_test = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [None]:
batch_size = 256
epochs = 3
model = get_cnn_model(num_filters = 32, filter_sizes = [3,5], embed_size = get_embedding_size('glove'), embedding_matrix = embedding_matrix)
X_tra, X_val, y_tra, y_val = train_test_split(X_train, y[labels], train_size=0.95, random_state=233)
hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val)) # verbose = 2

In [None]:
y_pred = model.predict(X_val, batch_size=1024)
results = model.evaluate(X_val, y_val)
print('Test loss, Test acc:', results)

In [None]:
# save the model
model.save('cnn_glove.h5')
# reload_model = keras.models.load_model('cnn_glove.h5')

### 5.3 FastText Embeddings

In [None]:
embedding_matrix = build_matrix('fasttext', max_features = 20000)
embedding_matrix.shape

In [None]:
max_features, maxlen = embedding_matrix.shape

# tokenize training and test data
tk = Tokenizer(num_words=max_features)
tk.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tk.texts_to_sequences(list_sentences_train)
list_tokenized_test = tk.texts_to_sequences(list_sentences_test)

# pad sequences for homogeneous length
X_train = pad_sequences(list_tokenized_train, maxlen=maxlen)
X_test = pad_sequences(list_tokenized_test, maxlen=maxlen)

In [None]:
batch_size = 256
epochs = 3
model = get_cnn_model(num_filters = 32, filter_sizes = [3,5], embed_size = get_embedding_size('fasttext'), embedding_matrix = embedding_matrix)
X_tra, X_val, y_tra, y_val = train_test_split(X_train, y[labels], train_size=0.95, random_state=233)
hist = model.fit(X_tra, y_tra, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val)) # verbose = 2

In [None]:
y_pred = model.predict(X_val, batch_size=1024)
results = model.evaluate(X_val, y_val)
print('Test loss, Test acc:', results)

In [None]:
# save the model
model.save('cnn_fasttext.h5')
# reload_model = keras.models.load_model('cnn_fasttext.h5')

## 5. Conclusions
- Selecting a sensitive number of `max_features` reduced training time from around 20 min per epoch to less than a minute. Accuracy wasn't affected.
- All three pre-embeddings provide good results, which suggests that the three original documents they where trained on contain enough information to characterize and effectively differenciate different types of toxic behaviour.