# Text Classification from data preparation perspective -  based on Quora competition

## Preface

Objective of the notebook is to show how to build text classification models using three text representations:    
* text represented by features created manually (manual feature engineering),  
* words frequency based text representation (matrix of token counts/ TF-IDF features), 
* sequence of words (word embedding) representation,
    
On top of that I would like to point on some solutions that I find very useful:     
* [sklearn.pipeline](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline) module that implements utilities to build a composite estimator, as a chain of transforms and estimators,    
* [sklearn.compose.ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) module that applies transformers to columns of an array or pandas DataFrame - recently added to scikit-learn,    
* [keras.wrappers.scikit_learn.KerasClassifier](https://keras.io/scikit-learn-api/) wrappers that allow to use Sequential Keras models  as part of Scikit-Learn pipeline,   

The list of topics that aren't covered here:
* advanced method of text cleaning and spell checking,   
* character-based language models, 
* ngram tokenizers,   
* word embeddings details,
* PyTorch

#### References:

1) [The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)     
2) [Understanding LSTM Networks by Christopher Olah](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)   
3) [An Introduction to Bag-of-Words in NLP](https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428)   
4) [Understanding word embeddings](https://hackernoon.com/understanding-word-embeddings-a9ff830403ce)     
5) [Useful properties of ROC curves, AUC scoring, and Gini Coefficients](https://luckytoilet.wordpress.com/2018/04/04/useful-properties-of-roc-curves-auc-scoring-and-gini-coefficients/)     
6) [Building a custom Python scikit-learn transformer for machine learning.](https://opendevincode.wordpress.com/2015/08/01/building-a-custom-python-scikit-learn-transformer-for-machine-learning/)    
7) [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/)    
8) [Wrappers for the Scikit-Learn API](https://keras.io/scikit-learn-api/)    
9) [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)    
10) [Empirical Evaluation of RNN Architectures on Sentence Classification Task](https://arxiv.org/pdf/1609.09171.pdf)    
11) [Getting started with the Keras functional API](https://keras.io/getting-started/functional-api-guide/)   
12) [Understanding Bidirectional RNN in PyTorch](https://towardsdatascience.com/understanding-bidirectional-rnn-in-pytorch-5bd25a5dd66)    
13) [Attention and Memory in Deep Learning and NLP](http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/)

## Problem overview - dataset used in this notebook

I decided to use dataset from [Quora Insincere Questions Classification](https://www.kaggle.com/c/quora-insincere-questions-classification) because: 
* in my opinion the qality is quite good, 
* it consist only from two columns: text and target column so I don't need to focus on other stuff that aren't so important on purpose of this notebook,
* I know this dataset quite well because I was attending this competition :) 

**In this competition we were predicting whether a question asked on Quora was sincere or not.**

[**Quora**](https://www.quora.com/) is a place to gain and share knowledge. It's a platform to ask questions and connect with people who contribute unique insights and quality answers.

An insincere question is defined as a question intended to make a statement rather than look for helpful answers.    
Some characteristics that can signify that a question is insincere:

* Has a non-neutral tone,
* Has an exaggerated/overemphasized tone to underscore a point about a group of people,
* Is rhetorical and meant to imply a statement about a group of people,
* Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype,
* Makes disparaging attacks/insults against a specific person or group of people,
* Isn't grounded in reality,
* Based on false information, or contains absurd assumptions,
* Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers,

The training data includes the question that was asked, and target value that means whether text was identified as insincere (target = 1) or not (target = 0).     
The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.     
    
### Examples:

**insincere (target: 1)**   
* Why don't nannies want to work caring for black or half black children?
* What is it like being black/Muslim/homosexual/immigrant etc and support Trump?
* Do Muslims make animals watch other animals get slaughtered to trigger hormone release into their Halal meat?
* Why do Black women hate straight men?   
* Do you dislike me for being white?

**sincere  (target: 0)**   
* How can I create a fake UK student ID?
* What are some good negotiating practices? 
* How do I do good work in a new start up?
* Is it possible for a black hole to become a nova/supernova/hypernova?
* What will happen in each of the areas if the fortests are removed?

## Text representation summary

** In ML the process of converting NLP text into numbers is called vectorization **

Let's have a look a graph that show three different ways of representing a text:

<img src="https://i.imgur.com/O1xE5hG.jpg" width="500" >

### 1. manual feature engineering    
a document/observation is described by a set of features created by hand ex.: 
* how long is the document/observation,   
* how many words is in document/observation,   
* what is the ratio of numbers of words to number of characters,
* how many www links contains text, 
* how many email addresses contains text, 
* does it contain swear word,     
* etc.

### 2. frequency based representation
a document/observation is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity,        
just like in a manual feature engineering approache one record/vector represent one observation.
* **matrix of token counts - CountVectorizer**     
CountVectorizer works on Terms Frequency, i.e. counting the occurrences of tokens and building a sparse matrix of documents x tokens.

    ```
    matrix = [
       [0 1 1 1 0 0 1 0 1]
       [0 2 0 1 0 1 1 0 1]
       [1 0 0 1 1 0 1 1 1]
       [0 1 1 1 0 0 1 0 1]
    ]
    ```
* **a matrix of TF-IDF features -  TF-IDF Vectorizer**     
TF-IDF stands for term frequency-inverse document frequency. TF-IDF weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

\begin{align}
\text{TF-IDF score} \:&=\: TF * IDF\\
TF \:&=\: \frac{\text{term frequency in document}}{\text{total words in document}}\\
IDF \:&=\: \log\left(\frac{\text{total number of documents}}{\text{documents with term}}\right)\\
 \end{align}
 
 
   <u>TF (Term frequency)</u> :  Reward words having high occurrence in a document [Frequent]    
   <img src="https://i.imgur.com/KYxoGEM.jpg" width="250" >
   <u>IDF (Inverse Document Frequency)</u> :  Penalize  words appearing  many times in  a document collection. Too general words  should not have have  high weight eg. "or" "not" "is" "the"  [Reward Rarity].   
    <img src="https://i.imgur.com/YCZjqjP.jpg" width="250" >



```
   matrix =  [
        [0.,0.46979139, 0.58028582, 0.38408524, 0., 0., 0.38408524, 0., 0.38408524],
        [0., 0.6876236, 0., 0.28108867, 0., 0.53864762, 0.28108867, 0., 0.28108867],
        [0.51184851, 0., 0., 0.26710379, 0.51184851,0., 0., 0.51184851, 0.26710379],
        [0., 0.46979139, 0.58028582, 0.38408524, 0., 0., 0.38408524, 0., 0.38408524]
     ]
```

### 3. sequence based representation     
  
a document/observation is represented as a sequence of its words, each word could be represented as:   
* word embedding (word2vec, glove, etc.) - a learned representation for text where words that have the same meaning have a similar representation [- more in article](https://hackernoon.com/understanding-word-embeddings-a9ff830403ce)
* one-hot vector (not recommended),     

More on this topic later...

 ## Load data

Load data and divide into:   
**Train set** - 90%, data to train model, [1 mln]     
**Test set** - 5%, data to overview model performance during training and for hyperparameters tuning, [130 K]     
**Validation set** - 5%, data to estimate future performance of the model, when it will be working with new unseen spo far data, [130 K]

In [None]:
#set seed
seed = 1029

#import data
import pandas as pd
train = pd.read_csv('../input/train.csv')

#divide data info train (90%), test (5%) and valid (5%) set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train[['question_text']], train[['target']], 
                                                    test_size=0.2, random_state=seed,
                                                    stratify=train['target'].tolist(),
                                                    shuffle = True)
X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5, random_state=seed,
                                                    stratify=y_test,shuffle = True)

Quick peek on data:

In [None]:
train.head(5)

In [None]:
#clean up
import gc 
import time 

del(train)
gc.collect()
time.sleep(2)

In [None]:
import matplotlib.pyplot as plt #for vizualization of data
from pylab import rcParams
import numpy as np
from collections import Counter
%matplotlib inline

cmap = plt.get_cmap("tab20c")
colors = cmap((np.arange(10)))

rcParams['figure.figsize'] = 20, 5
fig, ax = plt.subplots(1, 4, sharex='col', sharey='row')

ax[0].pie([X_train.shape[0],X_test.shape[0],X_valid.shape[0]], explode=(0, 0.1,0.1), 
          labels= ["train \n{}mln".format(round(X_train.shape[0]/1000000,2)), "test\n{}K".format(round(X_test.shape[0]/1000,1)),"valid\n{}K".format(round(X_valid.shape[0]/1000,1))], autopct='%1.0f%%',
        shadow=True, startangle=75, colors=colors)
ax[0].axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax[0].set_title('Data splits \n', fontsize=20)

ax[1].pie(Counter(y_train.target).values(), explode=(0,0.1), labels= ["sincere","insincere"], autopct='%1.0f%%',
        shadow=True, startangle=45, colors=colors)
ax[1].axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax[1].set_title('Train set \n', fontsize=20)

ax[2].pie(list(Counter(y_test.target).values())[::-1], explode=(0,0.1), labels= ["sincere","insincere"], autopct='%1.0f%%',
        shadow=True, startangle=45, colors=colors)
ax[2].axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax[2].set_title('Test set \n', fontsize=20)

ax[3].pie(Counter(y_valid.target).values(), explode=(0,0.1), labels= ["sincere","insincere"], autopct='%1.0f%%',
        shadow=True, startangle=45, colors=colors)
ax[3].axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
ax[3].set_title('Valid set \n', fontsize=20)
plt.show()

## Let's start building classifiers

### 1.  Random guessing 

Just assign a random value **0** or **1** to the output and check the score.   
We do this to get a reference  how our evaluation metrics work.    


   <img src="https://www.random.org/analysis/dilbert.jpg" width="500" >



In [None]:
import time
start = time.time()
import numpy as np
#y_pred = list(np.repeat(0.99,len(y_valid)))
np.random.seed(1029)
y_pred = list(np.random.uniform(size=len(y_valid)))
stop = time.time()
time_elapsed = stop-start

In [None]:
from sklearn.metrics import f1_score, accuracy_score

def bestThressholdF1(y_train_,train_preds_):
    tmp = [0,0,0] # idx, cur, max
    delta = 0
    for tmp[0] in np.arange(0.1, 0.501, 0.01):
        tmp[1] = f1_score(y_train_, np.array(train_preds_)>tmp[0])
        if tmp[1] > tmp[2]:
            delta = tmp[0]
            tmp[2] = tmp[1]
    return tmp[2]

def bestThressholdACC(y_train_,train_preds_):
    tmp = [0,0,0] # idx, cur, max
    delta = 0
    for tmp[0] in np.arange(0.1, 0.501, 0.01):
        tmp[1] = accuracy_score(y_train_, np.array(train_preds_)>tmp[0])
        if tmp[1] > tmp[2]:
            delta = tmp[0]
            tmp[2] = tmp[1]
    return tmp[2]

from sklearn import metrics

def get_scores(y_train__,train_preds__):
    dict_ = {}
    dict_['F1']=bestThressholdF1(y_train__,train_preds__)
    dict_['accuracy'] = bestThressholdACC(y_train__,train_preds__)
    fpr, tpr, thresholds = metrics.roc_curve(y_train__,train_preds__, pos_label=1)
    dict_['auc']=metrics.auc(fpr, tpr)
    dict_['gini'] = (metrics.auc(fpr, tpr)-0.5)*2
    return dict_

In [None]:
scores = []
scores.append(('Random guessing',get_scores(y_valid.target,y_pred), y_pred, 
               time.strftime("%Mmin %Ssec", time.gmtime(time_elapsed)),
              0))

In [None]:
scores[-1][1]

* **F1 score** - this metric was used to assess resaults in competition, it is a harmonic mean of  Precision and Recall, F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0
 <img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/057ffc6b4fa80dc1c0e1f2f1f6b598c38cdd7c23" width="250" >

   <img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" width="250" >

* **accuracy** - what is the percentage of the correct predictions, perfect score is 100%,     
* **AUC** (Area Under Curve) is the area enclosed by the ROC curve. A perfect classifier has AUC = 1 and a completely random classifier has AUC = 0.5,   
* **GINI**  - is 2*AUC – 1, and its purpose is to normalize the AUC so that a random classifier scores 0, and a perfect classifier scores 1. The range of possible Gini coefficient scores is [-1, 1].


In [None]:
import gc 
import time 

del(y_pred)
gc.collect()
time.sleep(2)

## 2.  Manual feature engineering - Ridge Logistic Regression

   <img src="https://i.imgur.com/RXqO9vv.jpg" width="400" >


Create some features based on text by hand:
* text length,
* number of words,
* density,
* number of title words,
* number of capital words,
* number of capital words / number of words,
* number of unique words,
* number of unique words / number of words,

We will be using some great [scikit-learn](https://scikit-learn.org/stable/index.html) features :    

* [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) - it allows to make a data flow pipeline,
* [custom transformer](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py) - need this to create custom transformer that apply custom functions on data,
* [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) - need this to operate on pandas dataframe columns,    


Model: Ridge Logistic Regression

Define functions that calculates features listed above:

In [None]:
def text_length(_list):
    return list(map(lambda x: len(x),_list))

def text_words(_list):
    return list(map(lambda x: len(x.split()),_list))

def text_density(_list):
    return np.array(text_words(_list))/np.array(text_length(_list))

def text_title_words(_list):
    return list(map(lambda x: sum([w.istitle() for w in x.split()]),_list))

def text_capital_words(_list):
    return list(map(lambda x: sum(1 for c in x if c.isupper()),_list))

def text_caps_vs_length(_list):
    return list(map(lambda x: sum(1 for c in x if c.isupper())/len(x),_list))

def text_unique(_list):
    return list(map(lambda x: len(set(w for w in x.split())),_list))

def text_words_vs_unique(_list):
    return list(map(lambda x: len(set(w for w in x.split()))/len(x.split()),_list))

from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

def text_stopwords(_list):
    return list(map(lambda x: sum([1 if i in stopWords else 0 for i in x.split()]),_list))

Building a custom stateless scikit-learn transformer that take a function and apply it on data.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class ApplyFunctionCT(BaseEstimator, TransformerMixin):
    
    def __init__(self, function, **kwargs):
        self.function = function
        self.kwargs = kwargs
        
    def fit(self, x, y = None):
        return self
    
    def get_feature_names(self):
        if hasattr(self, "columnNames"):
            return self.columnNames
        else:
            return None  
    
    def transform(self, x):
        if len(self.kwargs) == 0:
            wyn = x.apply(self.function)
        else:
            wyn = x.apply(self.function, param = self.kwargs)
        self.columnNames = wyn.columns
        return wyn

Define dataflow pipeline:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer,RobustScaler, MaxAbsScaler
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
import category_encoders as ce

pipe = make_pipeline(
        make_pipeline(
            ColumnTransformer([
                ('text_length', ApplyFunctionCT(function=text_length),['question_text']),
                ('text_words', ApplyFunctionCT(function=text_words),['question_text']),
                ('text_density', ApplyFunctionCT(function=text_density),['question_text']),
                ('text_title_words', ApplyFunctionCT(function=text_title_words),['question_text']),
                ('text_capital_words', ApplyFunctionCT(function=text_capital_words),['question_text']),
                ('text_caps_vs_length', ApplyFunctionCT(function=text_caps_vs_length),['question_text']),
                ('text_unique', ApplyFunctionCT(function=text_unique),['question_text']),
                ('text_words_vs_unique', ApplyFunctionCT(function=text_words_vs_unique),['question_text']),
                ]),
            KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile'),        
            ),
    LogisticRegression(penalty = 'l2', C = 0.2,  random_state=seed, solver = 'lbfgs',max_iter=400, 
                       verbose=1, n_jobs=-1) 
)

Train model:

In [None]:
import time
start = time.time()

pipe.fit(X_train, y_train.target)

stop = time.time()
time_elapsed = stop-start

Evaluate model:

In [None]:
import warnings
warnings.simplefilter("ignore")

y_predict = pipe.predict_proba(X_valid)[:,1]
scores.append(('Manual feature engineering',get_scores(y_valid,y_predict),y_predict, 
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
              (pipe.named_steps['logisticregression'].coef_).size))

print(scores[-1][1])

Define function that print progess

In [None]:
import matplotlib.pyplot as plt #for vizualization of data
%matplotlib inline

from pylab import rcParams
import numpy as np
from collections import Counter
import seaborn as sns

def plot_progress(list_):
    fig, ax = plt.subplots(figsize=(14, 6))
    #sns.set_style("whitegrid")

    plt.plot(list(range(len(list_))), [i[1]['gini'] for i in list_], '-ok')
    for j,i in enumerate(zip([i[1]['gini'] for i in list_],
                 list(range(len(list_))),
                 [i[0] for i in list_],
                 [i[3] for i in list_],
                 [i[4] for i in list_])):
        ax.text(i[1], i[0]-0.08, 
                i[2]+',\nG: '+str(round(i[0],3))+', t: '+i[3],rotation=-25, size=10, color = ['black','blue'][j%2])

    ax.patch.set_facecolor('white')
    ax.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)

    ax.spines['bottom'].set_color('0.5')
    ax.spines['top'].set_color(None)
    ax.spines['left'].set_color('0.5')
    ax.spines['right'].set_color(None)

    plt.title('Progress Tracker', fontsize=12)
    plt.xlabel('Models', fontsize=11)
    plt.ylabel('Gini', fontsize=11)
    plt.xticks(fontsize=9)
    plt.yticks(fontsize=9)
    pass


In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(pipe, y_predict)
gc.collect()
time.sleep(2)

## 3.  Frequency based embedding, TF-IDF Vectorizer - Ridge Logistic Regression

   <img src="https://i.imgur.com/r8e2yq1.jpg" width="400" >

We are using here [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)  - a scikit-learn module that clean, tokenize and produce a ready to use sparse matrix with TF-IDF score.   

Model: Ridge Logistic Regression

In [None]:
max_features_Vectorizer = None

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer,RobustScaler, MaxAbsScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(
    ColumnTransformer([
        ('CV', TfidfVectorizer(lowercase=True, 
                               ngram_range=(1, 1), 
                               max_features=max_features_Vectorizer, 
                               dtype=np.float32,
                               use_idf=True),'question_text')]),
    LogisticRegression(penalty = 'l2', C = 2,  random_state=seed, solver = 'lbfgs',max_iter=400, 
                       verbose=1, n_jobs=-1)
    )

In [None]:
import time
start = time.time()

pipe.fit(X_train, y_train.target)

stop = time.time()
time_elapsed = stop-start

In [None]:
#x = pipe.named_steps['columntransformer'].transform(X_train)
#print("Sparsity equals {}- the number of zero-valued elements divided by the total number of elements".format(
#    ((x.shape[0]*x.shape[1]) -x.getnnz())/(x.shape[0]*x.shape[1])))

Sparsity equals 0.9999332042305805 - the number of zero-valued elements divided by the total number of elements


In [None]:
y_predict = pipe.predict_proba(X_valid)[:,1]
scores.append(('TF-IDF Vectorizer, Ridge Regression',get_scores(y_valid,y_predict),y_predict,
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
               (pipe.named_steps['logisticregression'].coef_).size))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(pipe, y_predict)
gc.collect()
time.sleep(2)

## 4.  Frequency based embedding, MLP in Keras (The Sequential model API)

   <img src="https://i.imgur.com/4XcxCNh.jpg" width="400" >

Now instead of Ridge Logistic Regression we use Multilayer Perceptron Classifier, build on Keras Sequential model Scikit-Learn wrapper.    
From here we start using GPU...

In [None]:
max_features_ = 10000

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Activation, InputLayer, BatchNormalization, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.initializers import glorot_normal

# For custom metrics
import keras.backend as K

def create_model():
    model = Sequential([
        Dense(units=192,input_dim=max_features_,kernel_initializer=glorot_normal(seed=seed)),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2,seed=seed),
        Dense(units=64,input_dim=max_features_,kernel_initializer=glorot_normal(seed=seed)),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2,seed=seed),
        Dense(units=32,input_dim=max_features_,kernel_initializer=glorot_normal(seed=seed)),
        BatchNormalization(),
        Activation('relu'),
        Dropout(0.2,seed=seed),
        Dense(1),
        Activation('sigmoid'),
    ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Model summary:

In [None]:
create_model().summary()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer,RobustScaler, MaxAbsScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(
    ColumnTransformer([
        ('CV', TfidfVectorizer(lowercase=True, ngram_range=(1, 1), max_features=max_features_, dtype=np.float32,
                               use_idf=True),'question_text')]),
    KerasClassifier(build_fn=create_model, epochs=3, batch_size=512, verbose=1)
    )

In [None]:
import time
start = time.time()

pipe.fit(X_train, y_train.target)

stop = time.time()
time_elapsed = stop-start

In [None]:
y_predict = pipe.predict_proba(X_valid)[:,1]
scores.append(('TF-IDF Vectorizer, Keras MLP',get_scores(y_valid,y_predict),y_predict,
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
               pipe.named_steps['kerasclassifier'].model.count_params()))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

## 5. Sequence Models

### Create sequences from text

Each type of recurrent neural network (RNN, LSTM, GRU) require text dataset transformed into 3D tensor.      

   <img src="https://i.imgur.com/CknWXlS.jpg" width="400" >

Please notice that each word is a vector of numbers (from pretarained embedding), we don't know what exactly specific number means but to building an intuition we could imagine that it's ex: gender, age, red, speed, scent.     
So, car and motorbike will have similar "speed" feature, but for turtle this feature will be completely different, tomato and orange will have much closer value of "red" feature then garlic, and so on. 

### To prepare text in a form of 3D tensor we need apply following steps:

Each type of recurrent neural network (RNN) require fixed-length input (length of text).    
When we are starting to prepare data for RNN we need to analize data and chose max text length.    
It's good to limit number of unique words to - this affects computation time.

In [None]:
#settings 
maxlen = 70 # max number of words in a question to use
max_features = 120000 # how many unique words to use (i.e num rows in embedding vector)


Next we need to do apply text cleaning and word tokenization.    
I recommend to use Keras Tokenizer, which makes it in fast and simple way - I've try many others approaches but Keras Tokenizer is the best so far.

In [None]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features,
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
                      lower=True, 
                      split=' ', 
                      char_level=False, 
                      oov_token=None, 
                      document_count=0)

Here's some main parameters of ```keras.preprocessing.text.Tokenizer```
* **num_words**: the maximum number of words to keep, based on word frequency. Only the most common num_words words will be kept.      
* **filters**: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character. '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'       
* **lower**: boolean. Whether to convert the texts to lowercase.       
* **split**: str. Separator for word splitting.       
* **char_level**: if True, every character will be treated as a token.     
* **oov_token**: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls

In [None]:
tokenizer.fit_on_texts(X_train.question_text.tolist())

To represent text in sequential way and use it to train a model, we need 3 elements:   
* **text represented as a list of numbers**, where each number represents specific word, 
* **word_index dictionary**- a dictionary that translates numbers to words,   
* **word embedding** - a numeric word representation: either dense word embedding (ex. GloVe) or one-hot encoded,


Keras Tokenizer can prepapare word index for you.

In [None]:
word_index = tokenizer.word_index

In [None]:
im = 10
a = {}
for j,i in enumerate(tokenizer.word_index):
    a[i]=tokenizer.word_index[i]
    if j>=im:
        break
a

Keras Tokenizer can translate a text to list of numbers  - mapped to word_index,

In [None]:
X_train_seq = tokenizer.texts_to_sequences(X_train.question_text.tolist())
X_test_seq = tokenizer.texts_to_sequences(X_test.question_text.tolist())
X_valid_seq = tokenizer.texts_to_sequences(X_valid.question_text.tolist())

In [None]:
print(X_train.question_text.tolist()[0])
print(X_train_seq[0])
print('------------------------------------------------------------------------')
print(X_train.question_text.tolist()[100])
print(X_train_seq[100])
print('------------------------------------------------------------------------')
print(X_train.question_text.tolist()[202])
print(X_train_seq[202])

To assure that each text has a fixed size we need to apply sequence padding.  We can use Keras pad_sequences method.

In [None]:
X_train_seq = pad_sequences(X_train_seq, maxlen=maxlen)
X_test_seq = pad_sequences(X_test_seq, maxlen=maxlen)
X_valid_seq = pad_sequences(X_valid_seq, maxlen=maxlen)

In [None]:
print(X_train.question_text.tolist()[0])
print(X_train_seq[0])
print('------------------------------------------------------------------------')
print(X_train.question_text.tolist()[100])
print(X_train_seq[100])
print('------------------------------------------------------------------------')
print(X_train.question_text.tolist()[202])
print(X_train_seq[202])

In [None]:
import gc 
import time 

del(tokenizer)
gc.collect()
time.sleep(2)

### Load GloVe: pretrained word embedding /pretrained word vectors.

Step by step:
1.  Load file with pretrained word vectors, 
2. Select only words that are in word_index dictionary,   
3. If there is not pretrained word vector for our word from word_index - we replace it with embedding average value,
4. As a result we will get an 2D array,

In [None]:
def load_glove(word_index, max_features__ = max_features):
    EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')[:300]
    embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))
    
    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = -0.005838499,0.48782197
    embed_size = all_embs.shape[1]

    nb_words = min(max_features__, len(word_index))
    np.random.seed(1029)
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
    for word, i in word_index.items():
        if i >= max_features__: continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: embedding_matrix[i] = embedding_vector
            
    return embedding_matrix 

In [None]:
glove_embeddings = load_glove(word_index)

Combining all things together: text represented as a list of numbers, word_index, word embedding and one additional layer (Keras Embedding) we  produced 3D tensor.     
 
 Let's have a look at embedding for sentence `the quick fast brown fox jumps over the lazy slow dog` (30 of 300 vecotors),    

Could you find a word features that have similar values for words `quick` and `fast`?

In [None]:
import seaborn as sns; sns.set()
sz = 30
plt.figure(figsize=(9,9))
plt.title('Embedding for sentence `the quick fast brown fox jumps over the lazy slow dog` ({} of 300 vecotors) \n'.format(str(sz)))
ax = sns.heatmap(pd.DataFrame(np.hstack([glove_embeddings[word_index['the'],:].reshape(-1,1),
           glove_embeddings[word_index['quick'],:].reshape(-1,1),
           glove_embeddings[word_index['fast'],:].reshape(-1,1),
           glove_embeddings[word_index['brown'],:].reshape(-1,1),
           glove_embeddings[word_index['fox'],:].reshape(-1,1),
           glove_embeddings[word_index['jumps'],:].reshape(-1,1),
           glove_embeddings[word_index['over'],:].reshape(-1,1),
           glove_embeddings[word_index['the'],:].reshape(-1,1),
           glove_embeddings[word_index['lazy'],:].reshape(-1,1),
           glove_embeddings[word_index['slow'],:].reshape(-1,1),
           glove_embeddings[word_index['dog'],:].reshape(-1,1)  
          ])[:sz,:],columns = ['the','quick','fast','brown','fox','jumps','over','the','lazy','slow','dog']
                   ,index = ['word feature {}'.format(str(i)) for i in range(sz)]
                             ), 
                 cbar=False,annot=True,annot_kws={"size": 10})

## 5.1 Simple RNN, Many to One (Keras Functional API)


   <img src="https://i.imgur.com/lUVc0QT.jpg" width="500" >


There is a lot to say about Recurrent Neural Networks, but to make a long story short:     
Think of it a s many single neural networks (NN) - as many as the fixed text length (in this case 70).  
Each of these single NN takes one vector that represent a word, makes some predictions on it and pass a prediction to the next NN, which concatenate the output of the previous NN + input word and makes some predictions on it. Then it passes it to the next NN and so on, and so on.    
The one thing that you need to know is that all of this NN share the same weights - you will find more about it at the end of this notebook.

In [None]:
#nb_words = len(word_index)+1
nb_words = glove_embeddings.shape[0]
WV_DIM = glove_embeddings.shape[1]
maxlen = maxlen
SEED = 1029

In [None]:
from keras import Input
from keras.layers import Embedding, SpatialDropout1D, CuDNNLSTM, CuDNNGRU, Dropout, Dense, SimpleRNN
from keras.layers import concatenate, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.initializers import glorot_normal, orthogonal
# First layer
# create input
main_input = Input(shape=(maxlen,), dtype='int32',name='main_input')
# creating the embedding
embedded_sequences = Embedding(input_dim = nb_words,
                               output_dim = WV_DIM,
                               mask_zero=False,
                               weights=[glove_embeddings],
                               input_length=maxlen,
                               trainable=False)(main_input)
#Second layer
embedded_sequences = SpatialDropout1D(0.2, seed=seed)(embedded_sequences)
x = SimpleRNN(64, return_sequences=False,
                            kernel_initializer=glorot_normal(seed=seed),
                            recurrent_initializer=orthogonal(seed=seed))(embedded_sequences)

#output (batch, 64)
#The input format should be three-dimensional: the three components represent sample size, number of time steps and output dimension
preds = Dense(16, activation="relu", kernel_initializer=glorot_normal(seed=seed))(x)
preds = Dropout(0.1,seed=seed)(preds)
preds = Dense(1, activation="sigmoid", kernel_initializer=glorot_normal(seed=seed))(preds)


In [None]:
from keras.models import Model
from keras.optimizers import Adam
model = Model(inputs=main_input, outputs=preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(clipvalue=1), metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
import time
start = time.time()

hist = model.fit(X_train_seq, y_train, batch_size=1024, epochs=5, 
                 validation_data=(X_test_seq, y_test))

stop = time.time()
time_elapsed = stop-start

In [None]:
pred_val = model.predict(X_valid_seq, batch_size=1024, verbose=1)

In [None]:
scores.append(('RNN, Many to One',get_scores(y_valid.target.tolist(),list(pred_val[:,0])),list(pred_val[:,0]),
              time.strftime("%M:%S", time.gmtime(time_elapsed)),
              model.count_params()
              ))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(model, hist, pred_val)
gc.collect()
time.sleep(2)


## 5.2 LSTM, Many to One

<img src="https://i.imgur.com/Qx8W4Fo.jpg" width="500" >


Now we try to use LSTM instead of RNN. LSTM prevent the vanishing gradient problem by applying solution to better "remember" things from the past time steps.    
More about the differences at the end of the notebook.

In [None]:
from keras import Input
from keras.layers import Embedding, SpatialDropout1D, CuDNNLSTM, CuDNNGRU, Dropout, Dense, SimpleRNN
from keras.layers import concatenate, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.initializers import glorot_normal, orthogonal
# First layer
# create input
main_input = Input(shape=(maxlen,), dtype='int32',name='main_input')
# creating the embedding
embedded_sequences = Embedding(input_dim = nb_words,
                               output_dim = WV_DIM,
                               mask_zero=False,
                               weights=[glove_embeddings],
                               input_length=maxlen,
                               trainable=False)(main_input)
#Second layer
embedded_sequences = SpatialDropout1D(0.2, seed=SEED)(embedded_sequences)
x = CuDNNLSTM(64, return_sequences=False,
                            kernel_initializer=glorot_normal(seed=seed),
                            recurrent_initializer=orthogonal(seed=seed))(embedded_sequences)

#output (batch, 64)
#The input format should be three-dimensional: the three components represent sample size, number of time steps and output dimension
preds = Dense(16, activation="relu", kernel_initializer=glorot_normal(seed=seed))(x)
preds = Dropout(0.1,seed=seed)(preds)
preds = Dense(1, activation="sigmoid", kernel_initializer=glorot_normal(seed=seed))(preds)


we  use clipvalue here

In [None]:
from keras.models import Model
from keras.optimizers import Adam
model = Model(inputs=main_input, outputs=preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(clipvalue=1), metrics=['accuracy'])

In [None]:
import time
start = time.time()

hist = model.fit(X_train_seq, y_train, batch_size=1024, epochs=5, 
                 validation_data=(X_test_seq, y_test))

stop = time.time()
time_elapsed = stop-start

In [None]:
pred_val = model.predict(X_valid_seq, batch_size=1024, verbose=1)

In [None]:
scores.append(('LSTM, Many to One',get_scores(y_valid.target.tolist(),list(pred_val[:,0])),list(pred_val[:,0]),
              time.strftime("%M:%S", time.gmtime(time_elapsed)),
              model.count_params()
              ))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(model, hist, pred_val)
gc.collect()
time.sleep(2)

## 5.3 Bidirectional LSTM, Many to One



<img src="https://i.imgur.com/o2NRLPM.jpg" width="500" >



We have nowe similar model to the last one, but instead of single LSTM we use Bidirectional LSTM. 

"Bidirectional recurrent neural networks(RNN) are really just putting two independent RNNs together. The input sequence is fed in normal time order for one network, and in reverse time order for another. The outputs of the two networks are usually concatenated at each time step, though there are other options, e.g. summation."

In [None]:
from keras import Input
from keras.layers import Embedding, SpatialDropout1D, CuDNNLSTM, CuDNNGRU, Dropout, Dense, SimpleRNN
from keras.layers import concatenate, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.initializers import glorot_normal, orthogonal
# First layer
# create input
main_input = Input(shape=(maxlen,), dtype='int32',name='main_input')
# creating the embedding
embedded_sequences = Embedding(input_dim = nb_words,
                               output_dim = WV_DIM,
                               mask_zero=False,
                               weights=[glove_embeddings],
                               input_length=maxlen,
                               trainable=False)(main_input)
#Second layer
embedded_sequences = SpatialDropout1D(0.2, seed=SEED)(embedded_sequences)
x = Bidirectional(CuDNNLSTM(64, return_sequences=False,
                  return_state = False,
                  kernel_initializer=glorot_normal(seed=seed),
                  recurrent_initializer=orthogonal(seed=seed)))(embedded_sequences)

#The input format should be three-dimensional: the three components represent sample size, number of time steps and output dimension
preds = Dense(16, activation="relu", kernel_initializer=glorot_normal(seed=seed))(x)
preds = Dropout(0.1,seed=seed)(preds)
preds = Dense(1, activation="sigmoid", kernel_initializer=glorot_normal(seed=seed))(preds)


In [None]:
from keras.models import Model
from keras.optimizers import Adam
model = Model(inputs=main_input, outputs=preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(clipvalue=1), metrics=['accuracy'])

In [None]:
import time
start = time.time()

hist = model.fit(X_train_seq, y_train, batch_size=1024, epochs=5, 
                 validation_data=(X_test_seq, y_test))

stop = time.time()
time_elapsed = stop-start

In [None]:
pred_val = model.predict(X_valid_seq, batch_size=1024, verbose=1)

In [None]:
scores.append(('BiLSTM, Many to One',get_scores(y_valid.target.tolist(),list(pred_val[:,0])),
               list(pred_val[:,0]),
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
               model.count_params()
              ))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(model, hist, pred_val)
gc.collect()
time.sleep(2)


## 5.4 Bidirectional LSTM, Many to Many + Pooling


<img src="https://i.imgur.com/URhvXb7.jpg" width="500" >



Instead of using only the last LSTM prediction we could use predictions from all the time steps to predict the score.  
The  widely used RNN structure for this is “Pooling Model”.    
The pooling BiLSTM model can be regarded as each LSTM nodes voting for the feature layer. The “max pooling” is a form for voting, that's choose always the highest value. “Mean pooling” calculates the value of the i’th position in vector h by averaging the corresponding value of each hidden state vector from BLSTM nodes.

In [None]:
from keras import Input
from keras.layers import Embedding, SpatialDropout1D, CuDNNLSTM, CuDNNGRU, Dropout, Dense, SimpleRNN
from keras.layers import concatenate, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.initializers import glorot_normal, orthogonal
# First layer
# create input
main_input = Input(shape=(maxlen,), dtype='int32',name='main_input')
# creating the embedding
embedded_sequences = Embedding(input_dim = nb_words,
                               output_dim = WV_DIM,
                               mask_zero=False,
                               weights=[glove_embeddings],
                               input_length=maxlen,
                               trainable=False)(main_input)
#Second layer
embedded_sequences = SpatialDropout1D(0.2, seed=SEED)(embedded_sequences)
x, forward_h, forward_c, backward_h, backward_c = Bidirectional(CuDNNLSTM(64, return_sequences=True,return_state = True,
                  kernel_initializer=glorot_normal(seed=seed),
                  recurrent_initializer=orthogonal(seed=seed)))(embedded_sequences)
state_h = concatenate([forward_h, backward_h])
state_c = concatenate([forward_c, backward_c])

#x (?, 70, 128)
#forward_h (?, 64)
#forward_c (?, 64)
#backward_h (?, 64)
#backward_c (?, 64)

avg_pool = GlobalAveragePooling1D()(x)
#avg_poll = (?, 128)
max_pool = GlobalMaxPooling1D()(x)
#max_pool = (?, 128)

conc = concatenate([state_h, avg_pool, max_pool])
#conc (?, 384)

#The input format should be three-dimensional: the three components represent sample size, number of time steps and output dimension
preds = Dense(16, activation="relu", kernel_initializer=glorot_normal(seed=seed))(conc)
preds = Dropout(0.1,seed=seed)(preds)
preds = Dense(1, activation="sigmoid", kernel_initializer=glorot_normal(seed=seed))(preds)


In [None]:
from keras.models import Model
from keras.optimizers import Adam
model = Model(inputs=main_input, outputs=preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(clipvalue=1), metrics=['accuracy'])

In [None]:
import time
start = time.time()

hist = model.fit(X_train_seq, y_train, batch_size=1024, epochs=5, 
                 validation_data=(X_test_seq, y_test))

stop = time.time()
time_elapsed = stop-start

In [None]:
pred_val = model.predict(X_valid_seq, batch_size=1024, verbose=1)

In [None]:
scores.append(('BiLSTM, Many to Many, Max Pooling',get_scores(y_valid.target.tolist(),list(pred_val[:,0])),
               list(pred_val[:,0]),
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
               model.count_params()
              ))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(model, hist, pred_val)
gc.collect()
time.sleep(2)

#get_scores(y_valid,np.average(np.vstack([i[2] for i in scores[2:]]),axis=0).reshape(-1))

## 5.4 Bidirectional LSTM + Bidirectional GRU, Many to Many + Pooling

<img src="https://i.imgur.com/4TXRmeF.jpg" width="500" >

RNN could be stack just like as deep Neutral Networks, in this case we stack 2 layers of RNN (LSTM ad GRU). Its rather rare to stack more then 4 layer of RNN. 

In [None]:
from keras import Input
from keras.layers import Embedding, SpatialDropout1D, CuDNNLSTM, CuDNNGRU, Dropout, Dense, SimpleRNN
from keras.layers import concatenate, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.initializers import glorot_normal, orthogonal
# First layer
# create input
main_input = Input(shape=(maxlen,), dtype='int32',name='main_input')
# creating the embedding
embedded_sequences = Embedding(input_dim = nb_words,
                               output_dim = WV_DIM,
                               mask_zero=False,
                               weights=[glove_embeddings],
                               input_length=maxlen,
                               trainable=False)(main_input)
#Second layer
embedded_sequences = SpatialDropout1D(0.2, seed=seed)(embedded_sequences)

#Third layer 
x = Bidirectional(CuDNNLSTM(64, return_sequences=True,
                            kernel_initializer=glorot_normal(seed=seed),
                            recurrent_initializer=orthogonal(seed=seed)))(embedded_sequences)

#Fourth layer 
x, x_h, x_c  = Bidirectional(CuDNNGRU(64, return_sequences=True,return_state = True,
                            kernel_initializer=glorot_normal(seed=seed),
                            recurrent_initializer=orthogonal(seed=seed)))(x)

#concatenate

avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
conc = concatenate([x_h, avg_pool, max_pool])
#The input format should be three-dimensional: the three components represent sample size, number of time steps and output dimension
preds = Dense(16, activation="relu", kernel_initializer=glorot_normal(seed=seed))(conc)
preds = Dropout(0.1,seed=seed)(preds)
preds = Dense(1, activation="sigmoid", kernel_initializer=glorot_normal(seed=seed))(preds)


In [None]:
from keras.models import Model
from keras.optimizers import Adam
model = Model(inputs=main_input, outputs=preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(clipvalue=1), metrics=['accuracy'])

In [None]:
import time
start = time.time()

hist = model.fit(X_train_seq, y_train, batch_size=1024, epochs=5, 
                 validation_data=(X_test_seq, y_test))

stop = time.time()
time_elapsed = stop-start

In [None]:
pred_val = model.predict(X_valid_seq, batch_size=1024, verbose=1)

In [None]:
scores.append(('BiLSTM+BiGRU, M2M, Max Pooling',get_scores(y_valid.target.tolist(),list(pred_val[:,0])),
               list(pred_val[:,0]),
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
               model.count_params()
              ))

In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

In [None]:
import gc 
import time 

del(model, hist, pred_val)
gc.collect()
time.sleep(2)

#get_scores(y_valid,np.average(np.vstack([i[2] for i in scores[2:]]),axis=0).reshape(-1))

## 5.5 Bidirectional LSTM + Bidirectional GRU, Many to Many + Pooling + Attention

An example of use an attention mechanism in RNN.

In [None]:
from keras import backend as K
from keras.engine.topology import Layer
#from keras import initializations
from keras import initializers, regularizers, constraints

#https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043
class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        """
        Keras Layer that implements an Attention mechanism for temporal data.
        Supports Masking.
        Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
        # Input shape
            3D tensor with shape: `(samples, steps, features)`.
        # Output shape
            2D tensor with shape: `(samples, features)`.
        :param kwargs:
        Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
        The dimensions are inferred based on the output shape of the RNN.
        Example:
            model.add(LSTM(64, return_sequences=True))
            model.add(Attention())
        """
        self.supports_masking = True
        #self.init = initializations.get('glorot_uniform')
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        # eij = K.dot(x, self.W) TF backend doesn't support it

        # features_dim = self.W.shape[0]
        # step_dim = x._keras_shape[1]

        features_dim = self.features_dim
        step_dim = self.step_dim

        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b

        eij = K.tanh(eij)

        a = K.exp(eij)

        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # Cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())

        # in some cases especially in the early stages of training the sum may be almost zero
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

        a = K.expand_dims(a)
        weighted_input = x * a
    #print weigthted_input.shape
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
        #return input_shape[0], input_shape[-1]
        return input_shape[0],  self.features_dim

In [None]:
from keras import Input
from keras.layers import Embedding, SpatialDropout1D, CuDNNLSTM, CuDNNGRU, Dropout, Dense, SimpleRNN
from keras.layers import concatenate, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.initializers import glorot_normal, orthogonal
# First layer
# create input
main_input = Input(shape=(maxlen,), dtype='int32',name='main_input')
# creating the embedding
embedded_sequences = Embedding(input_dim = nb_words,
                               output_dim = WV_DIM,
                               mask_zero=False,
                               weights=[glove_embeddings],
                               input_length=maxlen,
                               trainable=False)(main_input)
#Second layer
embedded_sequences = SpatialDropout1D(0.2, seed=seed)(embedded_sequences)

#Third layer 
x = Bidirectional(CuDNNLSTM(64, return_sequences=True,
                            kernel_initializer=glorot_normal(seed=seed),
                            recurrent_initializer=orthogonal(seed=seed)))(embedded_sequences)

#Fourth layer 
x, x_h, x_c  = Bidirectional(CuDNNGRU(64, return_sequences=True,return_state = True,
                            kernel_initializer=glorot_normal(seed=seed),
                            recurrent_initializer=orthogonal(seed=seed)))(x)

#concatenate

att = Attention(maxlen)(x)
avg_pool = GlobalAveragePooling1D()(x)
max_pool = GlobalMaxPooling1D()(x)
conc = concatenate([x_h, avg_pool, max_pool, att])

#The input format should be three-dimensional: the three components represent sample size, number of time steps and output dimension
preds = Dense(16, activation="relu", kernel_initializer=glorot_normal(seed=seed))(conc)
preds = Dropout(0.1,seed=seed)(preds)
preds = Dense(1, activation="sigmoid", kernel_initializer=glorot_normal(seed=seed))(preds)



In [None]:
from keras.models import Model
from keras.optimizers import Adam
model = Model(inputs=main_input, outputs=preds)
model.compile(loss='binary_crossentropy', optimizer=Adam(clipvalue=1), metrics=['accuracy'])

In [None]:
import time
start = time.time()

hist = model.fit(X_train_seq, y_train, batch_size=1024, epochs=5, 
                 validation_data=(X_test_seq, y_test))

stop = time.time()
time_elapsed = stop-start

In [None]:
pred_val = model.predict(X_valid_seq, batch_size=1024, verbose=1)

In [None]:
scores.append(('BiLSTM+BiGRU, M2M, Pool + Att',get_scores(y_valid.target.tolist(),list(pred_val[:,0])),
               list(pred_val[:,0]),
               time.strftime("%M:%S", time.gmtime(time_elapsed)),
               model.count_params()
              ))


In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

## 6. Blend results

To tweak a results even further just average the results from best models.

<img src="http://worldartsme.com/images/simple-blender-clipart-1.jpg" width="70" >


In [None]:
import time
start = time.time()
pred_val = np.average(np.vstack([i[2] for i in scores[-4:]]),axis=0).reshape(-1)

stop = time.time()
time_elapsed = stop-start

In [None]:
scores.append(('Blend last 4 models',get_scores(y_valid,pred_val),list(pred_val),
              time.strftime("%M:%S", time.gmtime(time_elapsed)),
              'n/a'
              ))


In [None]:
scores[-1][1]

In [None]:
plot_progress(scores)

# Conclusions 

* NLP is not easy but it's a lot of fun. 
* There is many possibilities left to try to build the best model in this problem but I think that these mentioned here are a good starter to do something on your own.     
* To find the best model you have to try and try and try.    
* You need a GPU unit to train your models, fortunately Kaggle have it now.  
* Keras is great - easy to learn and to use, but has some huge flaw when you use it on GPU: the results are not fully deterministic.   
* Many practitioners recommend to move to PyTorch when you are on higher level of experience with LSTM. Check my other kernel if you want to see a model made in PyTorch:  https://www.kaggle.com/nicke1/fine-text-preproc-concat-embedding-lstm-gru-att

# Questions that I've asked

1) [Why are the weights of RNN/LSTM networks shared across time?](https://stats.stackexchange.com/questions/221513/why-are-the-weights-of-rnn-lstm-networks-shared-across-time)     
"The 'shared weights' perspective comes from thinking about RNNs as feedforward networks unrolled across time. If the weights were different at each moment in time, this would just be a feedforward network. But, I suppose another way to think about it would be as an RNN whose weights are a time-varying function (and that could let you keep the ability to process variable length sequences).

If you did this, the number of parameters would grow linearly with the number of time steps."

2) [Difference between logistic regression and neural networks](https://stats.stackexchange.com/questions/43538/difference-between-logistic-regression-and-neural-networks)   
"Logistic regression: The simplest form of Neural Network, that results in decision boundaries that are a straight line"

3) [Different Between LSTM and LSTMCell Function in PyTorch](https://discuss.pytorch.org/t/different-between-lstm-and-lstmcell-function/5657/3)    
"LSTMCell is more flexible and you need less code with LSTM .

So with LSTMCell,
```
def forward(self, x):
        h = self.get_hidden() 
        for input in x:  
            h = self.rnn(input, h) # self.rnn = self.LSTMCell(input_size, hidden_size)
```
while with LSTM it is
```
def forward(self, x):
        h_0 = self.get_hidden()
        output, h = self.rnn(x, h_0) # self.rnn = self.LSTM(input_size, hidden_size)
```
"
4) [Why is the F-Measure a harmonic mean and not an arithmetic mean of the Precision and Recall measures?](https://stackoverflow.com/questions/26355942/why-is-the-f-measure-a-harmonic-mean-and-not-an-arithmetic-mean-of-the-precision)    
Consider a trivial method (e.g. always returning class A). There are infinite data elements of class B, and a single element of class A:

```
Precision: 0.0
Recall:    1.0
```
When taking the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! With the harmonic mean, the F1-measure is 0.
```
Arithmetic mean: 0.5
Harmonic mean:   0.0
```
In other words, to have a high F1, you need to both have a high precision and recall.

5) [Difference between feedback RNN and LSTM/ GRU](https://stats.stackexchange.com/questions/222584/difference-between-feedback-rnn-and-lstm-gru):      
"All RNNs have feedback loops in the recurrent layer. This lets them maintain information in 'memory' over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies. This is because the gradient of the loss function decays exponentially with time (called the vanishing gradient problem). LSTM networks are a type of RNN that uses special units in addition to standard units. LSTM units include a 'memory cell' that can maintain information in memory for long periods of time. A set of gates is used to control when information enters the memory, when it's output, and when it's forgotten. This architecture lets them learn longer-term dependencies. GRUs are similar to LSTMs, but use a simplified structure. They also use a set of gates to control the flow of information, but they don't use separate memory cells, and they use fewer gates.:

6) [What is gradient clipping and why is it necessary?](https://www.quora.com/What-is-gradient-clipping-and-why-is-it-necessary).   
Gradient clipping limits the magnitude of the gradient and can make stochastic gradient descent (SGD) behave better in the vicinity of steep cliffs. 
The steep cliffs commonly occur in recurrent networks in the area where the recurrent network behaves approximately linearly. SGD without gradient clipping overshoots the landscape minimum, while SGD with gradient clipping descends into the minimum.     

7) [how could i get both the final hidden state and sequence in a LSTM layer when using a bidirectional wrapper](https://stackoverflow.com/questions/49313650/how-could-i-get-both-the-final-hidden-state-and-sequence-in-a-lstm-layer-when-us)    

The call Bidirectional(LSTM(128, return_sequences=True, return_state=True))(input) returns 5 tensors:     
* The entire sequence of hidden states, by default it'll be the concatenation of forward and backward states.
* The last hidden state h for the forward LSTM
* The last cell state c for the forward LSTM
* The last hidden state h for the backward LSTM
* The last cell state c for the backward LSTM

```
lstm, forward_h, forward_c, backward_h, backward_c = Bidirectional(LSTM(128, return_sequences=True, return_state=True))(input)
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])
``` 

