## Quora Insincere Questions Classification - BiLSTM/GRU with Attention -or- CNN with 2D MaxPooling & using All Embeddings

#### Sunil Kumar
        
##### Solution workflows: - 

* Prepare voabulary & check against pre-training word embeddings vocab for coverage
* Observe the test dataset questions length histogram for identifying appropriate length (use similar count for LSTM units)
* Fix max_features as per cleaned corpus vocab
* Prepare embedding matrix for our vocab (using 3 of the given pretrained embeddings)
* Prepare input word vectors
* Define Bidirectional LSTM/GRU with Attention network (using different path per embedding & then concatenate 3 streams before Dense narrowing down) -or- CNN with 2D Max Pool
* Train & validate with training partitions through Keras Checkpointing callback which saves the model weights corresponding to the best val_accuracy/f1 => NOTE that this problem has its classes unbalanced and hence F1 Score is the basis of evaluation (it is the Evaluation Rule of this competition) => Keras used to support F1 based loss/accuracy/etc but not any more, hence feeding custom F1-loss/accuracy in the epochs iteration as well as Checkpointing system
* Re-create same raw model, load the saved best model weights and then Fit the test data to predict label
* Prepare the submission csv

##### Attention Layer in the NLP Neural Network

This solution does have seq-2-seq but in the intermediate layer of the network. Ultimately, the network is compressed through Pooling layer, compacting Dense layer, etc for the final goal of binary classification. Note that encoder-decoder pattern is not suited to this problem. This is called Additive Attention - refer to http://ruder.io/deep-learning-nlp-best-practices/.

Most often used in sequence-to-sequence models. Without an attention mechanism, your model has to capture the essence of the entire input sequence in a single hidden state. The attention mechanism is simply giving the network access to its internal memory (in this case, previous layer output). The network retrieves a weighted combination of all memory locations. The network learns these Attention weights too.

Decision analysis: -
* How to make use of given multiple pre-trained embeddings?
    * Use them to generate separate predictions from some defined n-net. Ideally, same n-net with these different embedding would fare differently which should help determine weight in averaging separate predictions.
    * Prepare a part-parallel sub-network to accept multiple inputs (one per embedding) and later merge/concatenate towards binary classifier last layer. NOTE that its feasibility has been checked under memory & runtime limitations.
    * It appears counter intuitive to average separate word embeddings as no correspondence exists between the dimensions (even when all are of same size 300) of separately trained word embedding sets. But this reference http://aclweb.org/anthology/N18-2031 provides support for averaging!
* Should prediction be derived through probability thresholding against 0.5 or some different threshold is required?
    * Not knowing optimal values of other hyperparameters (max_features, max_seq_len, lstm_units, batch_size, learning_rate), this threshold appears to be yet another hyperparameter. If others are roughly known, then this should be searched alone.
    * F1 Score does dependent on conditional probability https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4442797/.
* Why & how to normalize Embedding matrix? Should the Embedding Matrix be normalized for seq-2-seq attention leanring based sentiment analysis? https://www.kaggle.com/c/quora-insincere-questions-classification/discussion/72893
    * https://arxiv.org/pdf/1808.06305.pdf In the word embedding field, it is observed that learned word vectors usually share a large mean and several dominant principal components, which prevents word embedding from being isotropic. Word vectors that are isotropically distributed (or uniformly distributed in spatial angles) can be differentiated from each other more easily.

##### Ideas to try: -
* Address classes imbalance
    * Over-sampling of minority class using SMOTE (tried but it gave discouraging learning & prediction results)
    * Specifiy class weights input to Keras model learning
* Topic modeling based insincerity classification, i.e., sentiment analysis
* Windowed or localized Attention
    * Technically, Attention model should have some concept of 'window'! Yes, Stanford NLP confirms my gut feeling and in fact it helps in achieving better BLEU Score for seq-2-seq NMT! Refer to https://nlp.stanford.edu/pubs/emnlp15_attn.pdf .
* Ensemble Learning: Learn models separately with each of the pre-trained Embeddings and then take weighted average of the those predictions for final prediction estimation

In [None]:
MAX_FEATURES = 50000

# All given pre-trained embeddings have word vec size 300
EMBED_DIM = 300

# Just ~0.8% of questions are lengthier than 30+ words
MAX_SEQ_LEN = 30
LSTM_UNITS = 32

VALID_TRAIN_RATIO = 0.2

BATCH_SIZE = 512
N_EPOCHS = 5
LEARNING_RATE = 0.001

F1_threshold = 0.36

CNN_FILTER_SIZES = [3,1,3]
N_CNN_FILTERS = 32

In [None]:
# Ensuring reproducible randomness is of great importance for optimizing/tuning loss/accuracy goals across experiements/runs

from numpy.random import seed
seed(123)

from tensorflow import set_random_seed
set_random_seed(456)

In [None]:
import os
import pandas as pd
import numpy as np
import operator 
import re

from nltk.corpus import stopwords
from gensim.models import KeyedVectors
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn import metrics

import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras import optimizers
from keras.models import Model
from keras.layers import Input, Dense, Embedding, Dropout, LSTM, CuDNNGRU, Bidirectional, GlobalMaxPool1D, MaxPool2D, concatenate
from keras.layers import Reshape, Flatten, Conv2D, MaxPool2D
from keras.engine.topology import Layer
from keras import initializers, regularizers, constraints, optimizers, layers
from keras import backend as K
from keras.callbacks import ModelCheckpoint


from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short
from tqdm import tqdm
tqdm.pandas()

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df =  pd.read_csv("../input/test.csv")

In [None]:
(train_df.shape, test_df.shape)

In [None]:
def build_vocab(questions, verbose=True):
    vocab={}
    
    for question in tqdm(questions):
        for word in question:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [None]:
# Raw vocab

qs_train = train_df["question_text"].apply(lambda x: x.split()).values
qs_test = test_df["question_text"].apply(lambda x: x.split()).values
vocab_raw = build_vocab(list(qs_train) + list(qs_test))

vocab_raw_size = len(vocab_raw) + 1

In [None]:
# https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing/notebook
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }
mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}
punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }    

def clean_text(x):
    for dic in [contraction_mapping, mispell_dict, punct_mapping]:
        for word in dic.keys():
            x = x.replace(word, dic[word])
    return x

train_df['question_text'] = train_df['question_text'].progress_apply(lambda x: clean_text(x))
test_df['question_text'] = test_df['question_text'].progress_apply(lambda x: clean_text(x))


In [None]:
# Keeping numbers and replacing them by ### because it is present in pretrained embedding may not be useful for insincerety differentiation
# Gensim preprocess_string

txt_filters = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, remove_stopwords, strip_short]
qs_train = train_df["question_text"].apply(lambda x: preprocess_string(x, txt_filters))
qs_test = test_df["question_text"].apply(lambda x: preprocess_string(x, txt_filters))

train_df["question_text"] = qs_train.apply(lambda x: " ".join(x))
test_df["question_text"] = qs_test.apply(lambda x: " ".join(x))

# For prediction probability threshold search
_, val_df = train_test_split(train_df, test_size=VALID_TRAIN_RATIO, random_state=12345)

# +1 for missing token, just in case this remains smaller than MAX_FEATURES
vocab = build_vocab(list(qs_train) + list(qs_test))
vocab_size = len(vocab) + 1

In [None]:
train_insincere_qs = train_df[train_df.target == 1].question_text
train_sincere_qs = train_df[train_df.target == 0].question_text
test_qs = test_df.question_text

qs_train_insincere = train_insincere_qs.apply(lambda x: x.split()).values
qs_train_sincere = train_sincere_qs.apply(lambda x: x.split()).values
qs_test = test_qs.apply(lambda x: x.split()).values

vocab_train_insincere = build_vocab(list(qs_train_insincere))
vocab_train_sincere = build_vocab(list(qs_train_sincere))
vocab_test = build_vocab(list(qs_test))

(len(vocab_train_sincere), len(vocab_train_insincere), len(vocab_test))

In [None]:
# It seems that the huge amount of remaining keys in Sincere vocab are all mis-spelt & non-english words :)
#diff = set(vocab_train_sincere.keys()) - set(vocab_test.keys())

# Insincere train & test questions should be just enough for use as Vocab
good_vocab_keys = set(vocab_train_insincere.keys()).union(set(vocab_test.keys()))
good_vocab_size = len(good_vocab_keys)

max_tokens = MAX_FEATURES if vocab_size >= MAX_FEATURES else vocab_size
(MAX_FEATURES, max_tokens, vocab_size, vocab_raw_size)

In [None]:
# Insincere questions are 93.8%

sns.countplot(x='target', data=train_df)

In [None]:
def check_coverage(vocab, embeddings_ref):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            #a[word] = word2vecDict[word]
            a[word] = embeddings_ref[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

# word_index = tokenizer.word_index
# NOTE that due to known issue in Keras Tokenizer, num_words & len(word_index) are not same
def load_embeddings(word_index, embed_file, max_features):

    embeddings_ref = {}
    
    _, file_extension = os.path.splitext(embed_file)
    if file_extension == '.bin':
        embeddings_ref = KeyedVectors.load_word2vec_format(embed_file, binary=True)
    else:
        def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
        embeddings_ref = dict(get_coefs(*o.split(" ")) for o in open(embed_file, encoding="utf8", errors='ignore'))
        
    embedding_matrix = np.zeros((max_features, EMBED_DIM))
    
    for word, i in word_index.items():
        if file_extension == '.bin':
            embedding_vector = embeddings_ref[word]
        else:
            embedding_vector = embeddings_ref.get(word)
        
        if i >= max_features: continue
        
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
            
    oov = check_coverage(word_index, embeddings_ref)
            
    return embedding_matrix

class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
            a *= K.cast(mask, K.floatx())

        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        return input_shape[0],  self.features_dim

In [None]:
train_X = train_df["question_text"].values
val_X = val_df["question_text"].values
test_X = test_df["question_text"].values

val_y = val_df["target"].values
train_y = train_df["target"].values

In [None]:
# Feed all questions from train, val & test

tokenizer = Tokenizer(num_words=max_tokens)
tokenizer.fit_on_texts(list(train_X) + list(test_X))
#tokenizer.fit_on_texts(vocab_train_insincere.keys(), vocab_test.keys())
#tokenizer.fit_on_texts(list(train_X) + list(val_X) + list(test_X))

In [None]:
#tokenizer.fit_on_texts(list(vocab_train_insincere.keys()) + list(vocab_test.keys()))
word_index = tokenizer.word_index

In [None]:
# All 4 embeddings have word vec size as 300

# TODO: Fix its parsing
#embeddings_path = "../input/embeddings/GoogleNews-vectors-negative300/GoogleNews-vectors-negative300.bin"
#embeddings_gnews = load_embeddings(word_index, embeddings_path, max_tokens)

embeddings_path = "../input/embeddings/glove.840B.300d/glove.840B.300d.txt"
embeddings_glove = load_embeddings(word_index, embeddings_path, max_tokens)
emb_mean = np.mean(embeddings_glove,axis = 0)
emb_std = np.std(embeddings_glove, axis = 0)
embeddings_glove = (embeddings_glove - emb_mean) / emb_std
print("Done preparing embeddings matrix from pre-trained GloVe!")

embeddings_path = "../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt"
embeddings_para = load_embeddings(word_index, embeddings_path, max_tokens)
emb_mean = np.mean(embeddings_para,axis = 0)
emb_std = np.std(embeddings_para, axis = 0)
embeddings_para = (embeddings_para - emb_mean) / emb_std
print("Done preparing embeddings matrix from pre-trained Paragram!")

embeddings_path = "../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec"
embeddings_wiki = load_embeddings(word_index, embeddings_path, max_tokens)
emb_mean = np.mean(embeddings_wiki,axis = 0)
emb_std = np.std(embeddings_wiki, axis = 0)
embeddings_wiki = (embeddings_wiki - emb_mean) / emb_std
print("Done preparing embeddings matrix from pre-trained WikiNews!")

In [None]:
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

In [None]:
mylen = np.vectorize(len)

len_train = mylen(train_X)
len_test = mylen(test_X)

# Keep MAX_SEQ_LEN at 30 as 30+ test questions are just 0.03%
(sum(len_train > 30)*100/train_df.shape[0], sum(len_test > 30)*100/test_df.shape[0])

In [None]:
# NOTE that Seaborn distplot does not support log scale :(

sns.distplot( mylen(train_X) , kde=False, color="skyblue", label="train_X")
sns.distplot( mylen(test_X) , kde=False, color="green", label="test_X")

plt.legend()

In [None]:
train_X = pad_sequences(train_X, maxlen=MAX_SEQ_LEN)
val_X = pad_sequences(val_X, maxlen=MAX_SEQ_LEN)
test_X = pad_sequences(test_X, maxlen=MAX_SEQ_LEN)

In [None]:
# Result was highly disocouraging
'''
from imblearn.over_sampling import SMOTE
 
smote = SMOTE(kind = "regular")
train_X, train_y = smote.fit_sample(train_X, train_y)
'''

In [None]:
#def model_CNN_MaxPool2D(embed_wiki):    
def model_CNN_MaxPool2D(embed_glove, embed_para, embed_wiki):  
    inp = Input(shape=(MAX_SEQ_LEN, ))
    
    x_glove = Embedding(max_tokens, EMBED_DIM, weights=[embed_glove])(inp)
    x_para = Embedding(max_tokens, EMBED_DIM, weights=[embed_para])(inp)
    x_wiki = Embedding(max_tokens, EMBED_DIM, weights=[embed_wiki])(inp)
    
    # Seq of word vec presented as 1-channel tensor volume (like gray in image CNN) MAX_SEQ_LEN rows x EMBED_DIM columns (each row here represent a token of input Seq)
    x_glove = Reshape((MAX_SEQ_LEN, EMBED_DIM, 1))(x_glove)
    x_para = Reshape((MAX_SEQ_LEN, EMBED_DIM, 1))(x_para)
    x_wiki = Reshape((MAX_SEQ_LEN, EMBED_DIM, 1))(x_wiki)
        
    conv_glove = Conv2D(N_CNN_FILTERS, kernel_size=(CNN_FILTER_SIZES[0], EMBED_DIM), kernel_initializer='he_normal', activation='tanh')(x_glove)
    conv_para = Conv2D(N_CNN_FILTERS, kernel_size=(CNN_FILTER_SIZES[1], EMBED_DIM), kernel_initializer='he_normal', activation='tanh')(x_para)
    conv_wiki = Conv2D(N_CNN_FILTERS, kernel_size=(CNN_FILTER_SIZES[2], EMBED_DIM), kernel_initializer='he_normal', activation='tanh')(x_wiki)
    
    maxpool_glove = MaxPool2D(pool_size=(MAX_SEQ_LEN - CNN_FILTER_SIZES[0] + 1, 1))(conv_glove)
    maxpool_para = MaxPool2D(pool_size=(MAX_SEQ_LEN - CNN_FILTER_SIZES[1] + 1, 1))(conv_para)
    maxpool_wiki = MaxPool2D(pool_size=(MAX_SEQ_LEN - CNN_FILTER_SIZES[2] + 1, 1))(conv_wiki)
        
    z = concatenate([maxpool_glove, maxpool_para, maxpool_wiki], axis=1)
    z = Flatten()(z)
    z = Dropout(0.1)(z)
        
    outp = Dense(1, activation="sigmoid")(z)
    
    model = Model(inputs=inp, outputs=outp)

    # Decay rate momentum rates (beta_1=0.9, beta_2=0.999), epsilon=None, decay=0.0, amsgrad=False
    adam = optimizers.Adam(lr=LEARNING_RATE)
    
    model.compile(optimizer=adam, loss=f1_loss, metrics=['accuracy', f1])
    #model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

# CuDNNGRU is ~6-7 times faster than GRU/LSTM on GPU
#def model_BidiGruLstm_Attention(embed_glove, embed_para, embed_wiki):
def model_BidiGruLstm_Attention(embed_wiki):
    inp = Input(shape=(MAX_SEQ_LEN,))
    
    #x_gnews = Embedding(max_tokens, EMBED_DIM, weights=[embed_gnews])(inp)
    x_glove = Embedding(max_tokens, EMBED_DIM, weights=[embed_glove])(inp)
    x_para = Embedding(max_tokens, EMBED_DIM, weights=[embed_para])(inp)
    x_wiki = Embedding(max_tokens, EMBED_DIM, weights=[embed_wiki])(inp)
    
    #x_gnews = Bidirectional(CuDNNGRU(LSTM_UNITS, return_sequences=True))(x_gnews)
    x_glove = Bidirectional(CuDNNGRU(LSTM_UNITS, return_sequences=True))(x_glove)
    x_para = Bidirectional(CuDNNGRU(LSTM_UNITS, return_sequences=True))(x_para)
    x_wiki = Bidirectional(CuDNNGRU(LSTM_UNITS, return_sequences=True))(x_wiki)
    
    #x_gnews = Attention(MAX_SEQ_LEN)(x_gnews)
    x_glove = Attention(MAX_SEQ_LEN)(x_glove)
    x_para = Attention(MAX_SEQ_LEN)(x_para)
    x_wiki = Attention(MAX_SEQ_LEN)(x_wiki)
    
    x = concatenate([x_glove, x_para, x_wiki])
    
    x = Dense(16, activation="relu")(x)
    x = Dropout(0.1)(x)
    
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    
    # Decay rate momentum rates (beta_1=0.9, beta_2=0.999), epsilon=None, decay=0.0, amsgrad=False
    adam = optimizers.Adam(lr=LEARNING_RATE)
    
    model.compile(optimizer=adam, loss=f1_loss, metrics=['accuracy', f1])
    #model.compile(optimizer=adam, loss='binary_crossentropy', metrics=['accuracy'])
    
    return model  

def f1(y_true, y_pred):
    
    # Adaptation of the "round()" used before to get the predictions. Clipping to make sure that the predicted raw values are between 0 and 1.
    y_pred = K.cast(K.greater(K.clip(y_pred, 0, 1), F1_threshold), K.floatx())
    #y_pred = K.round(y_pred)
    
    tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
    tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
    fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)

    p = tp / (tp + fp + K.epsilon())
    r = tp / (tp + fn + K.epsilon())

    f1 = 2*p*r / (p+r+K.epsilon())
    f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
    return K.mean(f1)

def f1_loss(y_true, y_pred):
    
    tp = K.sum(K.cast(y_true*y_pred, 'float'), axis=0)
    tn = K.sum(K.cast((1-y_true)*(1-y_pred), 'float'), axis=0)
    fp = K.sum(K.cast((1-y_true)*y_pred, 'float'), axis=0)
    fn = K.sum(K.cast(y_true*(1-y_pred), 'float'), axis=0)

    p = tp / (tp + fp + K.epsilon())
    r = tp / (tp + fn + K.epsilon())

    f1 = 2*p*r / (p+r+K.epsilon())
    f1 = tf.where(tf.is_nan(f1), tf.zeros_like(f1), f1)
    return 1 - K.mean(f1)

In [None]:
from pathlib import Path

tmp_path = Path("../tmp")
if not tmp_path.is_dir():
    os.mkdir(tmp_path)

filepath = "../tmp/weights_best.hdf5"

# Check print(model.metrics_names)
checkpoint = ModelCheckpoint(filepath, monitor='val_f1', verbose=1, save_best_only=True, mode='max')
#checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

In [None]:
#model = model_BidiGruLstm_Attention(embeddings_wiki)
model = model_CNN_MaxPool2D(embeddings_glove, embeddings_para, embeddings_wiki)

print(model.summary())

In [None]:
#model.fit(train_X, train_y, validation_data=(val_X, val_y), epochs=N_EPOCHS, batch_size=BATCH_SIZE, verbose=1)

model_fitting_history = model.fit(train_X, train_y, validation_split=VALID_TRAIN_RATIO, epochs=N_EPOCHS, batch_size=BATCH_SIZE, callbacks=callbacks_list, verbose=1)

In [None]:
'''
pred_val_y = model.predict([val_X], batch_size=BATCH_SIZE, verbose=1)
for thresh in np.arange(0.1, 0.701, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))
'''

In [None]:
plt.figure(1)
plt.subplot(311)
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['f1'])
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['val_f1'])
plt.title('model f1 -vs- val_f1')
plt.ylabel('F1 Score')
plt.xlabel('epoch')
plt.legend(['f1', 'val_f1'], loc='upper left')

plt.figure(2)
plt.subplot(312)
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['loss'])
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['val_loss'])
plt.title('model loss -vs- val_loss')
plt.ylabel('Loss')
plt.xlabel('epoch')
plt.legend(['loss', 'val_loss'], loc='upper left')

plt.figure(3)
plt.subplot(313)
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['acc'])
plt.plot(np.arange(1, N_EPOCHS+1), model_fitting_history.history['val_acc'])
plt.title('model acc -vs- val_acc')
plt.ylabel('Accuracy')
plt.xlabel('epoch')
plt.legend(['acc', 'val_acc'], loc='upper left')

plt.show()

In [None]:
# Recreate the original network model without above learned weights, so that best learned weights can be loaded into it for Inference, i.e., Prediction

#model_final = model_BidiGruLstm_Attention(embeddings_wiki)
model_final = model_CNN_MaxPool2D(embeddings_glove, embeddings_para, embeddings_wiki)
model_final.load_weights("../tmp/weights_best.hdf5")

In [None]:
pred_test_y = model_final.predict([test_X], batch_size=BATCH_SIZE, verbose=1)

In [None]:
pred_test_y = (pred_test_y > F1_threshold).astype(int)
#pred_test_y = np.round(pred_test_y).astype(int)
out_df = pd.DataFrame({"qid":test_df["qid"].values})
out_df['prediction'] = pred_test_y
out_df.to_csv("submission.csv", index=False)

##### Reference

* Some of the pre-processing code has been borrowed from https://www.kaggle.com/kalyankkr/quora-questions-insincere-words 