# Detecting duplicate questions

###### Info 
Date 16th of May 2017

Version 1.5: use of Neural networks

Version 1.3: use of Google News words vectors

Version 1.2: use of XGBoost - differences in length as feature

Version 1: initial framework

Gilles Daquin



# Synopsis

### Preparation


### Feature creation (base)

### Feature extension (did not give good results)

### Feature validation

### Training of model & test

### Prepare submission
a) save submission data


# Preparation

In [1]:
%matplotlib inline

#imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import re

import sklearn as sk
import nltk
from nltk import word_tokenize, ngrams
from nltk.corpus import stopwords
import gensim

from nltk import bigrams
from nltk import trigrams

import xgboost




In [2]:
from tqdm import tqdm
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM, GRU
from keras.layers.normalization import BatchNormalization
from keras.utils import np_utils
from keras.layers import Merge
from keras.layers import TimeDistributed, Lambda
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.callbacks import ModelCheckpoint
from keras import backend as K
from keras.layers.advanced_activations import PReLU
from keras.preprocessing import sequence, text

Using TensorFlow backend.


# General Architecture

Accuracy of 0.85 with a deep neural network which comprised of: 

Two translation layers, one for each question, initialized by GloVe embeddings 

Two LSTMs without GloVe embeddings

Two 1D convolutional layers which were also initialized by GloVe embeddings. 

This was followed by a series of dense layers with dropout and batch normalization. 

In [3]:
data = pd.read_csv("/home/gilles/Documents/NLP_code/Data/train.csv")
y = data.is_duplicate.values

tk = text.Tokenizer(nb_words=200000)




## input text as sequence numbers

vectorizing texts, or/and turning texts into sequences (=list of word indexes, where the word of rank i in the dataset (starting at 1) has index i).

In [4]:
#put questions into matrices

# build the dictionary
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))

#build the sequences (indexed on dictionary)
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)

x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)

word_index = tk.word_index

#transform in categorical data the is_duplicate feature
ytrain_enc = np_utils.to_categorical(y)


## writing up dictionary of glove vectors into a matrix

In [5]:
embeddings_index = {}
f = open('/home/gilles/Documents/NLP_code/Data/glove.840B.300d.txt')
for line in tqdm(f):
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

2196018it [01:51, 19767.41it/s]

Found 2196017 word vectors.





In [6]:
embedding_matrix = np.zeros((len(word_index) + 1, 300))

#we build a matrix of words with a dimension of 
# words_index = total size of questions corpus
# 300 size of glove vectors
# so for each word in the corpus, we will have a 300 dimension glove vector
for word, i in tqdm(word_index.items()):
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector


100%|██████████| 95603/95603 [00:00<00:00, 275181.62it/s]


### Embedding layers in Keras NN model

The **Embedding** layer is a simple matrix multiplication that transforms words into their corresponding word embeddings.

The weights of the Embedding layer are of the shape (vocabulary_size, embedding_dimension). For each training sample, its input are integers, which represent certain words. The integers are in the range of the vocabulary size. The Embedding layer transforms each integer i into the ith line of the embedding weights matrix.

In order to quickly do this as a matrix multiplication, the input integers are not stored as a list of integers but as a one-hot matrix. Therefore the input shape is (nb_words, vocabulary_size) with one non-zero value per line. If you multiply this by the embedding weights, you get the output in the shape

(nb_words, vocab_size) x (vocab_size, embedding_dim) = (nb_words, embedding_dim)

So with a simple matrix multiplication you transform all the words in a sample into the corresponding word embeddings.

**----------------------------------------------------------------------------------------------------------**

**TimeDistributed** is a Keras wrapper which makes possible to get any static (non-sequential) layer and apply it in a sequential manner. 

So if e.g. your layer accepts as an input something of shape (d1, .., dn) thanks to TimeDistributed wrapper your layer could accept an input with a shape of (sequence_len, d1, ..., dn) by applying a layer provided to X[0,:,:,..,:], X[1,:,...,:], ..., X[len_of_sequence,:,...,:]. 

An example of such usage might be using a e.g. pretrained convolutional layer to a short video clip by applying TimeDistributed(conv_layer) where conv_layer is applied to each frame of a clip. It produces the sequence of outputs which might be then consumed by next recurrent or TimeDistributed layer.

It's good to know that usage of TimeDistributedDense is depreciated and it's better to use TimeDistributed(Dense).

**----------------------------------------------------------------------------------------------------------**

**Lambda** layer is an easy way to customise a layer to do simple arithmetics. 

**Example:**

model.add(Lambda(lambda x: x ** 2))

**Arguments**

function: The function to be evaluated. Takes input tensor as first argument.

output_shape: Expected output shape from function. Only relevant when using Theano. Can be a tuple or function. If a tuple, it only specifies the first dimension onward; sample dimension is assumed either the same as the input: 

output_shape = (input_shape[0], ) + output_shape or, the input is None and the sample dimension is also None: 

output_shape = (None, ) + output_shape If a function, it specifies the entire shape as a function of the input shape: 

output_shape = f(input_shape)
    arguments: optional dictionary of keyword arguments to be passed to the function.

**Input shape**

Arbitrary. Use the keyword argument input_shape (tuple of integers, does not include the samples axis) when using this layer as the first layer in a model.

**Output shape**

Specified by output_shape argument (or auto-inferred when using TensorFlow).

In [9]:
max_features = 200000
filter_length = 5
nb_filter = 64
pool_length = 4

model = Sequential()
print('Build model...')

#------------------------------------------------------------------------------------------------------------------
# MODEL 1
#------------------------------------------------------------------------------------------------------------------

model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=40,
                     trainable=False))

model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))


Build model...


In [10]:
#------------------------------------------------------------------------------------------------------------------
# MODEL 2
#------------------------------------------------------------------------------------------------------------------

model2 = Sequential()
model2.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=40,
                     trainable=False))

model2.add(TimeDistributed(Dense(300, activation='relu')))
model2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))




In convolutional neural networks (CNNs), 1D and 2D filters are not really 1 and 2 dimensional. It is a convention for description.

e.g., each 1D filter is actually a Lx50 filter, where L is a parameter of filter length. The convolution is only performed in one dimension. That may be why it is called 1D. So, with proper padding, each 1D filter convolution gives a 400x1 vector. The Convolution1D layer will eventually output a matrix of 400*nb_filter

**Batch Normalization** is just another layer, so you can use it as such to create your desired network architecture.

The general use case is to use BN between the linear and non-linear layers in your network, because it normalizes the input to your activation function, so that you're centered in the linear section of the activation function (such as Sigmoid).

In [11]:
#------------------------------------------------------------------------------------------------------------------
# MODEL 3
#------------------------------------------------------------------------------------------------------------------


model3 = Sequential()
model3.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=40,
                     trainable=False))
model3.add(Convolution1D(nb_filter=nb_filter,
                         filter_length=filter_length,
                         border_mode='valid',
                         activation='relu',
                         subsample_length=1))
model3.add(Dropout(0.2))

model3.add(Convolution1D(nb_filter=nb_filter,
                         filter_length=filter_length,
                         border_mode='valid',
                         activation='relu',
                         subsample_length=1))

model3.add(GlobalMaxPooling1D())
model3.add(Dropout(0.2))

model3.add(Dense(300))
model3.add(Dropout(0.2))
model3.add(BatchNormalization())





In [12]:
#------------------------------------------------------------------------------------------------------------------
# MODEL 4
#------------------------------------------------------------------------------------------------------------------

model4 = Sequential()
model4.add(Embedding(len(word_index) + 1,
                     300,
                     weights=[embedding_matrix],
                     input_length=40,
                     trainable=False))
model4.add(Convolution1D(nb_filter=nb_filter,
                         filter_length=filter_length,
                         border_mode='valid',
                         activation='relu',
                         subsample_length=1))
model4.add(Dropout(0.2))

model4.add(Convolution1D(nb_filter=nb_filter,
                         filter_length=filter_length,
                         border_mode='valid',
                         activation='relu',
                         subsample_length=1))

model4.add(GlobalMaxPooling1D())
model4.add(Dropout(0.2))

model4.add(Dense(300))
model4.add(Dropout(0.2))
model4.add(BatchNormalization())




<img src="architectureNN.png">

In [13]:
#------------------------------------------------------------------------------------------------------------------
# MODEL 5
#------------------------------------------------------------------------------------------------------------------
model5 = Sequential()
model5.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2))
model5.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))




In [14]:
#------------------------------------------------------------------------------------------------------------------
# MODEL 6
#------------------------------------------------------------------------------------------------------------------
model6 = Sequential()
model6.add(Embedding(len(word_index) + 1, 300, input_length=40, dropout=0.2))
model6.add(LSTM(300, dropout_W=0.2, dropout_U=0.2))





In [15]:
#------------------------------------------------------------------------------------------------------------------
# MODEL Merge
#------------------------------------------------------------------------------------------------------------------
merged_model = Sequential()
merged_model.add(Merge([model1, model2, model3, model4, model5, model6], mode='concat'))
merged_model.add(BatchNormalization())




In [16]:
#------------------------------------------------------------------------------------------------------------------
# Add layer
#------------------------------------------------------------------------------------------------------------------
merged_model.add(Dense(300))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())


In [17]:
#------------------------------------------------------------------------------------------------------------------
# Add layer
#------------------------------------------------------------------------------------------------------------------
merged_model.add(Dense(300))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())


In [18]:
#------------------------------------------------------------------------------------------------------------------
# Add layer
#------------------------------------------------------------------------------------------------------------------
merged_model.add(Dense(300))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())


In [19]:
#------------------------------------------------------------------------------------------------------------------
# Add layer
#------------------------------------------------------------------------------------------------------------------
merged_model.add(Dense(300))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())


In [20]:
#------------------------------------------------------------------------------------------------------------------
# Add layer
#------------------------------------------------------------------------------------------------------------------
merged_model.add(Dense(300))
merged_model.add(PReLU())
merged_model.add(Dropout(0.2))
merged_model.add(BatchNormalization())


In [21]:
#------------------------------------------------------------------------------------------------------------------
# Output layer
#------------------------------------------------------------------------------------------------------------------
merged_model.add(Dense(1))
merged_model.add(Activation('sigmoid'))

merged_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

checkpoint = ModelCheckpoint('weights.h5', monitor='val_acc', save_best_only=True, verbose=2)

merged_model.fit([x1, x2, x1, x2, x1, x2], y=y, batch_size=384, nb_epoch=20,
verbose=1, validation_split=0.1, shuffle=True, callbacks=[checkpoint])



Train on 363861 samples, validate on 40429 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f6504266690>