# Overview

This notebook will run through step-by-step and provide explanatory notes on the following notebook:
* Basic LSTM: https://www.kaggle.com/thousandvoices/simple-lstm

Also leveraging this EDA to have better understanding of the data: https://www.kaggle.com/ekhtiar/unintended-eda-with-tutorial-notes

# Model Architecture Overview
![Image](https://i.ibb.co/XYjvJJ8/Screen-Hunter-3119.jpg)
![Image](https://i.ibb.co/RpNyPWb/Screen-Hunter-3118.jpg)

# Imports

## Pre-work: Get your pre-trained NLP model for embedding first
Click on "+ Add Dataset" on the top right corner, and add these two datasets:
* https://www.kaggle.com/takuok/glove840b300dtxt
* https://www.kaggle.com/yekenot/fasttext-crawl-300d-2m




In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from keras.models import Model
from keras.layers import Input, Dense, Embedding, SpatialDropout1D, add, concatenate
from keras.layers import CuDNNLSTM, Bidirectional, GlobalMaxPooling1D, GlobalAveragePooling1D
from keras.preprocessing import text, sequence
from keras.callbacks import LearningRateScheduler

import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
# import plotly.plotly as py
import plotly.offline as pyo
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

import os
print(os.listdir("../input"))

# Pre-work: Variables

In [None]:
NUM_MODELS = 2
BATCH_SIZE = 512
LSTM_UNITS = 128
DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
MAX_LEN = 220

# Pre-work: Basic EDA

In [None]:
print(os.listdir("../input"))
print(os.listdir("../input/jigsaw-unintended-bias-in-toxicity-classification"))

In [None]:
train_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv')
test_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/test.csv')
sample_df = pd.read_csv('../input/jigsaw-unintended-bias-in-toxicity-classification/sample_submission.csv')

In [None]:
train_df.head(2)


In [None]:
test_df.head(2)

In [None]:
sample_df.head(2)

# Pre-work - Working with label columns

In [None]:
IDENTITY_COLUMNS = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness'
]
AUX_COLUMNS = ['target', 'severe_toxicity', 'obscene', 'identity_attack', 'insult', 'threat']
TEXT_COLUMN = 'comment_text'
TARGET_COLUMN = 'target'

In [None]:
x_train = train_df[TEXT_COLUMN].astype(str)
y_train = train_df[TARGET_COLUMN].values
y_aux_train = train_df[AUX_COLUMNS].values
x_test = test_df[TEXT_COLUMN].astype(str)

for column in IDENTITY_COLUMNS + [TARGET_COLUMN]:
    train_df[column] = np.where(train_df[column] >= 0.5, True, False)

In [None]:
x_train.head()

In [None]:
y_train[:5]

In [None]:
train_df.head(5)

## Text Pre-processing

First, we fit the tokenizer on X train and test. The text to be fitted by tokenizer must be in form of list, so we convert the original DataFrame format into list first

In [None]:
from keras.preprocessing import text
CHARS_TO_REMOVE = '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n“”’\'∞θ÷α•à−β∅³π‘₹´°£€\×™√²—'

tokenizer = text.Tokenizer(filters=CHARS_TO_REMOVE)
tokenizer.fit_on_texts(list(x_train) + list(x_test))

Then we apply tokenizer to transform X_train and X_test dataframes into token sequences

In [None]:
x_train = tokenizer.texts_to_sequences(x_train)
x_test = tokenizer.texts_to_sequences(x_test)

Let's check out what it looks like

In [None]:
print(x_train[0])

Then we do padding so that each row will have the same length of tokens. We'll also print out one example

In [None]:
from keras.preprocessing import sequence
x_train = sequence.pad_sequences(x_train, maxlen=MAX_LEN)
x_test = sequence.pad_sequences(x_test, maxlen=MAX_LEN)
print(x_train[0])

As you can see, this example has relatively few word compared to the longest text. Therefore, we see a lot of zero padding being added in the front.

# Weightings to account for the bias metricss

### Explanation
The weighting steps below are to account for Bias AUC metrics presented in the evaluation. To recap, it consisted of 3 criteria:

**Subgroup AUC:** Here, we restrict the data set to only the examples that mention the specific identity subgroup. A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity.

**BPSN (Background Positive, Subgroup Negative) AUC:** Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

**BNSP (Background Negative, Subgroup Positive) AUC: **Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

Initialize weighting

In [None]:
sample_weights = np.ones(len(x_train), dtype=np.float32)

Let's start with the weighting on identity_columns and show the histogram

In [None]:
sample_weights += train_df[IDENTITY_COLUMNS].sum(axis=1)

"""
For reminder on identity columns:
IDENTITY_COLUMNS = [
    'male', 'female', 'homosexual_gay_or_lesbian', 'christian', 'jewish',
    'muslim', 'black', 'white', 'psychiatric_or_mental_illness'
]
"""

In [None]:
data = [go.Histogram(x=sample_weights)]
layout = {'title': 'Distribution of weights after adding identity_columns'}
iplot({'data':data, 'layout':layout})

Other weightings that we'll apply:
* Adding weights when the comment is abusive and not mentioning identity
* Adding 5x weights when the comment is not abusive, but an innocent mention of identity

In [None]:
sample_weights += train_df[TARGET_COLUMN] * (~train_df[IDENTITY_COLUMNS]).sum(axis=1)
sample_weights += (~train_df[TARGET_COLUMN]) * train_df[IDENTITY_COLUMNS].sum(axis=1) * 5
sample_weights /= sample_weights.mean()

# Embedding Matrix

First, we want to get the embedding from pre-trained model. The output is embedding index, which maps a word with its embedding vector.

Afterwards, we want to get our tokenizer word index from our training data corpus, and then for each word / word index, we'd map its embedding vector

The steps here also similar to steps outlined here: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

## First, create 3 custom functions

In [None]:
def get_coefs(word, *arr):
    return word, np.asarray(arr, dtype='float32')

def load_embeddings(path):
    with open(path) as f:
        return dict(get_coefs(*line.strip().split(' ')) for line in f)

def build_matrix(word_index, path):
    embedding_index = load_embeddings(path)
    embedding_matrix = np.zeros((len(word_index) + 1, 300))
    for word, i in word_index.items():
        try:
            embedding_matrix[i] = embedding_index[word]
        except KeyError:
            pass
    return embedding_matrix

Let's see how the load_embeddings function work:

In [None]:
path = '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec'
with open(path) as f:
    i = 0
    for line in f:
        output1 = line.strip().split(' ')
        print(output1)
        print(type(output1))
        print('=====')
        i += 1
        if i == 5:
            break              

Except for the first line, each line is a list of 301 elements. The first element is the word, and then the next 300 is the vector. So the function above will create dictionary that map each word to the vector.

The get_coef function reads 301 elements as input to the function, and will output a tupple of (word, 300-D array), which will get dictionarized

We'll create embedding index from each file separately to have better visibility

In [None]:
EMBEDDING_FILES = [
    '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec',
    '../input/glove840b300dtxt/glove.840B.300d.txt'
]
embedding_file1 = '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec'
embedding_file2 = '../input/glove840b300dtxt/glove.840B.300d.txt'

### From Embedding Index + Word Index --> Embedding Matrix

Manually create the embedding matrix

In [None]:
# embedding_index = load_embeddings(path)
# embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, 300))
# for word, i in tokenizer.word_index.items():
#     try:
#         embedding_matrix[i] = embedding_index[word]
#     except KeyError:
#         pass  

Now, print example of two words and show its embedding vector

In [None]:
print(embedding_matrix[:2])

### Now that we've done Embedding Matrix building manually, let's just rerun the provided function

In [None]:
embedding_matrix = np.concatenate(
    [build_matrix(tokenizer.word_index, f) for f in EMBEDDING_FILES], axis=-1)

# Now, build model stuffs

In [None]:
"""
Reminder: y_aux_train = train_df[AUX_COLUMNS].values
"""
num_aux_targets = y_aux_train.shape[-1]
print(num_aux_targets)

In [None]:

LSTM_UNITS = 128
DENSE_HIDDEN_UNITS = 4 * LSTM_UNITS
EPOCHS = 4

#### Explanation: Multiple LSTM Units
Here is a good explanation about having multiple LSTM units https://stackoverflow.com/questions/44273249/in-keras-what-exactly-am-i-configuring-when-i-create-a-stateful-lstm-layer-wi

![Image](https://i.stack.imgur.com/xLZCK.png)

![Image2](https://i.stack.imgur.com/PQs02.png)



Here is another good guide: https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f

### Spatial Dropout

Good explanation taken from stackoverflow https://stackoverflow.com/questions/50393666/how-to-understand-spatialdropout1d-and-when-to-use-it

>  In order to understand SpatialDropout1D, you should get used to the notion of the noise shape. In plain vanilla dropout, each element is kept or dropped independently. For example, if the tensor is [2, 2, 2], each of 8 elements can be zeroed out depending on random coin flip (with certain "heads" probability); in total, there will be 8 independent coin flips and any number of values may become zero, from 0 to 8.

> Sometimes there is a need to do more than that. For example, one may need to drop the whole slice along 0 axis. The noise_shape in this case is [1, 2, 2] and the dropout involves only 4 independent random coin flips. The first component will either be kept together or be dropped together. The number of zeroed elements can be 0, 2, 4, 6 or 8. It cannot be 1 or 5.

> You may want to do this to account for adjacent pixels correlation, especially in the early convolutional layers. Effectively, you want to prevent co-adaptation of pixels with its neighbors across the feature maps, and make them learn as if no other feature maps exist.

Another simpler one from https://datascience.stackexchange.com/questions/38519/what-does-spatialdropout1d-do-to-output-of-embedding-in-keras:

> Basically, it removes all the pixel in a row from all channels. eg: take [[1,1,1], [2,4,5]], there are 3 points with values in 2 channels, by doing SpatialDropout1D it zeros an entire row ie all attributes of a point is set to 0; like [[1,1,0], [2,4,0]]

> The intuition behind this is in many cases for an image the adjacent pixels are correlated, so hiding one of them is not helping much, rather hiding entire row, that's gotta make a difference


If you try to look up weights parameter in Keras.Layers.Embedding documentation, you would not find it there. It is because weights and trainable are inherited from its Layers base class

In [None]:
def build_model(embedding_matrix, num_aux_targets):
    words = Input(shape=(None,))
    x = Embedding(*embedding_matrix.shape, weights=[embedding_matrix], trainable=False)(words)
    x = SpatialDropout1D(0.2)(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
    x = Bidirectional(CuDNNLSTM(LSTM_UNITS, return_sequences=True))(x)
    hidden = concatenate([
        GlobalMaxPooling1D()(x),
        GlobalAveragePooling1D()(x),
    ])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    hidden = add([hidden, Dense(DENSE_HIDDEN_UNITS, activation='relu')(hidden)])
    result = Dense(1, activation='sigmoid')(hidden)
    aux_result = Dense(num_aux_targets, activation='sigmoid')(hidden)
    
    model = Model(inputs=words, outputs=[result, aux_result])
    model.compile(loss='binary_crossentropy', optimizer='adam')

    return model

In [None]:
model = build_model(embedding_matrix, y_aux_train.shape[-1])

In [None]:
model.summary()

Overview of model Summary
> ![Image](https://i.ibb.co/RpNyPWb/Screen-Hunter-3118.jpg)

In [None]:
# checkpoint_predictions = []
# weights = []
# EPOCHS = 4
# NUM_MODELS = 2
# BATCH_SIZE = 512

# for model_idx in range(NUM_MODELS):  # Not sure why we use this, since model_idx is never used below
#     model = build_model(embedding_matrix, y_aux_train.shape[-1])
#     for global_epoch in range(EPOCHS):
#         model.fit(
#             x_train,
#             [y_train, y_aux_train],
#             batch_size=BATCH_SIZE,
#             epochs=1,
#             verbose=2,
#             sample_weight=[sample_weights.values, np.ones_like(sample_weights)],
#             callbacks=[
#                 LearningRateScheduler(lambda _: 1e-3 * (0.55 ** global_epoch))
#             ]
#         )
#         checkpoint_predictions.append(model.predict(x_test, batch_size=2048)[0].flatten())
#         weights.append(2 ** global_epoch)

After each epoch run, we will run and save prediction based on the latest model. In the next section, we will make our prediction based on weighted average of each checkpoint prediction

# Now since the model is built, we'll run the predictions

Instead of using model from final epoch, we will use weighted mean of prediction from each epoch.  The weighting is based on 2^epoch, which doubles the weight for each epoch step

In [None]:
# predictions = np.average(checkpoint_predictions, weights=weights, axis=0)

# submission = pd.DataFrame.from_dict({
#     'id': test_df.id,
#     'prediction': predictions
# })
# submission.to_csv('submission.csv', index=False)

# Final Notes

This Kernel is still WIP
I commented out the training and prediction steps to cut down on running time, since the updates after the previous commit is just the markdown