## Quora Insincere Questions Classification - Keras Embedding & BiLSTM

#### Sunil Kumar
        
https://www.kaggle.com/suniliitb96/qiqc-with-bilstm-attention
        
##### Solution workflows: - 

* Text cleaning using Gensim to remove tags, punctuation, multiple_whitespaces, numeric, stopwords, short sentences (< 3 words)
* Prepare voabulary & check against pre-training word embeddings vocab for coverage
* Observe the distribution of test dataset questions length histogram distribution for identifying appropriate length (use similar count for LSTM units)
* Fix max_features as per cleaned corpus vocab
* Prepare embedding matrix for our vocab
* Prepare input word vectors
* Define Bidirectional LSTM/GRU with Attention network
* Train & validate with training partitions through Keras Checkpointing callback which saves the model weights corresponding to the best val_accuracy
* Re-create same raw model, load the saved best model weights and then Fit the test data to predict label
* Prepare the submission csv

##### Attention Layer in the NLP Neural Network

This solution does have seq-2-seq but in the intermediate layer of the network. Ultimately, the network is compressed through Pooling layer, compacting Dense layer, etc for the final goal of binary classification. Note that encoder-decoder pattern is not suited to this problem. This is called Additive Attention - refer to http://ruder.io/deep-learning-nlp-best-practices/.

Most often used in sequence-to-sequence models. Without an attention mechanism, your model has to capture the essence of the entire input sequence in a single hidden state. The attention mechanism is simply giving the network access to its internal memory (in this case, previous layer output). The network retrieves a weighted combination of all memory locations. The network learns these Attention weights too.

##### Keras Embedding

https://stats.stackexchange.com/questions/324992/how-the-embedding-layer-is-trained-in-keras-embedding-layer

Keras Embedding layer is just like any other in neural network (if we are not using any external pre-trained embeddings matrix like Word2Vec, GloVe, etc)! It participates with all other layers in the overall neural network for learning to optimize the end goal, i.e., minimize the loss! => I could not locate any official documentation on this 

It is completely different from Word2Vec or other pre-trained learning network. The Word2Vec refers to a very specific network setup (2 layer shallow along with few other optimizations) which tries to learn an embedding which captures the semantics of words.

##### Ideas to try: -
* Topic modeling based insincerity classification, i.e., sentiment analysis
* Windowed or localized Attention
* CNN & 2D Max Pooling
* Ensemble Learning: Learn models separately with each of the pre-trained Embeddings and then take weighted average of the those predictions for final prediction estimation

##### Things to explore: -
* Technically, Attention model should have some concept of 'window'! Yes, Stanford NLP confirms my gut feeling and in fact it helps in achieving better BLEU Score for seq-2-seq NMT! Refer to https://nlp.stanford.edu/pubs/emnlp15_attn.pdf . Need to update Attention to use local window.
* Should the Embedding Matrix be normalized for seq-2-seq attention leanring based sentiment analysis? https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/72893 
    * https://arxiv.org/pdf/1808.06305.pdf In the word embedding field, it is observed that learned word vectors usually share a large mean and several dominant principal components, which prevents word embedding from being isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) can be differentiated from each other more easily.

In [None]:
# Input to the Keras Embedding layer for learning on-the-fly word embedding
# Unlike pre-trained embedding where embedding dimension (word vector size) is fixed, here user can choose embedding dimention
EMBED_DIM = 300

MAX_FEATURES = 100000

# Just ~0.8% of questions are lengthier than 30+ words
MAX_SEQ_LEN = 60
LSTM_UNITS = 64

VALID_TRAIN_RATIO = 0.1

BATCH_SIZE = 512
N_EPOCHS = 10
LEARNING_RATE = 0.0001

In [None]:
import os
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from gensim.models import KeyedVectors

from wordcloud import WordCloud

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import optimizers
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Dropout, LSTM, CuDNNGRU, Bidirectional, GlobalMaxPool1D, MaxPool2D
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras import backend as K
from keras.callbacks import ModelCheckpoint

from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short
from tqdm import tqdm
tqdm.pandas()

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df =  pd.read_csv("../input/test.csv")

In [None]:
(train_df.shape, test_df.shape)

In [None]:
def build_vocab(questions, verbose=True):
    vocab={}
    
    for question in tqdm(questions, disable=(not verbose)):
        for word in question:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [None]:
questions = train_df["question_text"].progress_apply(lambda x: x.split()).values
vocab_raw = build_vocab(questions)

vocab_raw_size = len(vocab_raw) + 1

In [None]:
# 188878 as vocab_size after cleanup -vs- 508824 original vocab_raw_size

txt_filters = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short]
train_df["question_text"] = train_df["question_text"].progress_apply(lambda x: ' '.join(preprocess_string(x, txt_filters)))
test_df["question_text"] = test_df["question_text"].progress_apply(lambda x: ' '.join(preprocess_string(x, txt_filters)))

In [None]:
questions = train_df["question_text"].progress_apply(lambda x: x.split()).values
vocab = build_vocab(questions)

vocab_size = len(vocab) + 1
max_tokens = MAX_FEATURES if vocab_size >= MAX_FEATURES else vocab_size

print({k: vocab[k] for k in list(vocab)[:5]})

In [None]:
vocab_sorted = sorted(vocab.items(), key=lambda kv: kv[1], reverse=True)
#vocab_sorted = sorted(vocab.items(), key=lambda kv: kv[1])

In [None]:
vocab_part = dict((k, v) for k, v in vocab.items() if v <= 100)
len(vocab)

In [None]:
vocab_sorted[19990:20000]

In [None]:
x = {1: 2, 3: 4, 4: 3, 2: 1, 0: 0}
sorted_by_value = sorted(x.items(), key=lambda kv: kv[1])
sorted_by_value

In [None]:
iterator = iter(vocab_part.items())
for i in range(100):
    print(next(iterator))

In [None]:
(MAX_FEATURES, max_tokens, vocab_size, vocab_raw_size)

In [None]:
# Found embeddings for 30.05% of vocab (without text cleaning)
# Found embeddings for  87.66% of all text (without text cleaning)

# Found embeddings for 46.85% of vocab (after text cleaning)
# Found embeddings for  96.69% of all text (after text cleaning)

wiki_embed_path = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
embeddings_dict_master = {}
f = open(wiki_embed_path)

In [None]:
for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_dict_master[word] = coefs
f.close()

In [None]:
import operator 

def check_coverage(vocab, embeddings_dict_master):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            #a[word] = word2vecDict[word]
            a[word] = embeddings_dict_master[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [None]:
oov = check_coverage(vocab, embeddings_dict_master)
oov[:15]

In [None]:
# Insincere questions are 93.8%

sns.countplot(x='target', data=train_df)

In [None]:
train_X = train_df["question_text"].values
#val_X = val_df["question_text"].values
test_X = test_df["question_text"].values

train_y = train_df['target'].values
#val_y = val_df['target'].values

In [None]:
class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

In [None]:
# Feed all questions from train, val & test

tokenizer = Tokenizer(num_words=max_tokens)
#tokenizer.fit_on_texts(list(train_X) + list(val_X) + list(test_X))
tokenizer.fit_on_texts(list(train_X) + list(test_X))

In [None]:
# preparing the FastText word-embeddings matrix
embedding_matrix = np.zeros((max_tokens, EMBED_DIM))
word_index = tokenizer.word_index
for word, i in word_index.items():
    if i < max_tokens:
        embedding_vector = embeddings_dict_master.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector # words not found will be all zeroes

In [None]:
train_X = tokenizer.texts_to_sequences(train_X)
#val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

In [None]:
train_X[:5]

In [None]:
mylen = np.vectorize(len)
len_train = mylen(train_X)
#len_val = mylen(val_X)
len_test = mylen(test_X)

In [None]:
# NOTE that Seaborn distplot does not support log scale :(

sns.distplot( mylen(train_X) , kde=False, color="skyblue", label="train_X")
#sns.distplot( mylen(val_X) , kde=False, color="red", label="val_X")
sns.distplot( mylen(test_X) , kde=False, color="green", label="test_X")

plt.legend()

In [None]:
#unique_elements, counts_elements = np.unique(len_train, return_counts=True)
#print(np.asarray((unique_elements, counts_elements)))

# Keep MAX_SEQ_LEN at 30 as 30+ test questions are just 0.03%
#(sum(len_train > 30)*100/train_df.shape[0], sum(len_val > 30)*100/train_df.shape[0], sum(len_test > 30)*100/test_df.shape[0])
(sum(len_train > 30)*100/train_df.shape[0], sum(len_test > 30)*100/test_df.shape[0])

In [None]:
train_X = pad_sequences(train_X, maxlen=MAX_SEQ_LEN)
#val_X = pad_sequences(val_X, maxlen=MAX_SEQ_LEN)
test_X = pad_sequences(test_X, maxlen=MAX_SEQ_LEN)

In [None]:
# CuDNNGRU is ~6-7 times faster than GRU/LSTM on GPU

def model_BidiGruLstm_Attention(embedding_matrix):
    inp = Input(shape=(MAX_SEQ_LEN,))
    x = Embedding(max_tokens, EMBED_DIM, weights=[embedding_matrix])(inp)
    x = Bidirectional(CuDNNGRU(LSTM_UNITS, return_sequences=True))(x)
    
    x = Attention(MAX_SEQ_LEN)(x)
    
    x = Dense(16, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    
    # Decay rate momentum rates (beta_1=0.9, beta_2=0.999), epsilon=None, decay=0.0, amsgrad=False
    adam = optimizers.Adam(lr=LEARNING_RATE)
    
    model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
    
    return model    

In [None]:
from pathlib import Path

tmp_path = Path("../tmp")

if not tmp_path.is_dir():
    print("tmp folder is not available, hence created")
    os.mkdir(tmp_path)

In [None]:
filepath = "../tmp/weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

In [None]:
model = model_BidiGruLstm_Attention(embedding_matrix)
print(model.summary())

In [None]:
model_fitting_history = model.fit(train_X, train_y, validation_split=VALID_TRAIN_RATIO, epochs=N_EPOCHS, batch_size=BATCH_SIZE, callbacks=callbacks_list, verbose=0)

In [None]:
plt.figure(1)
plt.subplot(211)
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['acc'])
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['val_acc'])
plt.title('model val_accuracy')
plt.ylabel('validation accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validate'], loc='upper left')

plt.figure(2)
plt.subplot(212)
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['loss'])
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['val_loss'])
plt.title('model val_loss')
plt.ylabel('validation loss')
plt.xlabel('epoch')
plt.legend(['train', 'validate'], loc='upper left')

plt.show()

In [None]:
# Recreate the original network model without above learned weights, so that best learned weights can be loaded into it for Inference, i.e., Prediction

model_final = model_BidiGruLstm_Attention(embedding_matrix)
model_final.load_weights("../tmp/weights.best.hdf5")

In [None]:
pred_test_y = model_final.predict([test_X], batch_size=BATCH_SIZE, verbose=1)

In [None]:
pred_test_y = np.round(pred_test_y).astype(int)
out_df = pd.DataFrame({"qid":test_df["qid"].values})
out_df['prediction'] = pred_test_y
out_df.to_csv("submission.csv", index=False)