# Quoras Question Pairs Modeling Notebook

This notebook try to predict if some pair of Quoras questions are duplicated or not.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from zipfile import ZipFile
from time import time
from numpy import empty

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

test.csv
train.csv



First, lets get the train dataset.

In [2]:
df_train = pd.read_csv('../input/train.csv')
df_train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [3]:
texts = df_train[['question1','question2']]
labels = df_train['is_duplicate']

del df_train

Now, lets build our model. First we need tokenize the questions to create a word index, then we use a embedding layer that transforms the input vector, our sentences in terms of words index, to dense vectors that represents in the embedding space.

In [4]:
# Model params
MAX_NB_WORDS = 100000
MAX_SEQUENCE_LENGTH = 25
VALIDATION_SPLIT = 0.1
EMBEDDING_DIM = 32

# Train params
NB_EPOCHS = 1
BATCH_SIZE = 1024
VAL_SPLIT = 0.1
WEIGHTS_PATH = 'lstm_weights.h5'
SUBMIT_PATH = 'lstm_submission_1.csv'

Prepare the questions.

In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tk = Tokenizer(nb_words=MAX_NB_WORDS)

tk.fit_on_texts(list(texts.question1.values.astype(str)) + list(texts.question2.values.astype(str)))
x1 = tk.texts_to_sequences(texts.question1.values.astype(str))
x1 = pad_sequences(x1, maxlen=MAX_SEQUENCE_LENGTH)

x2 = tk.texts_to_sequences(texts.question2.values.astype(str))
x2 = pad_sequences(x2, maxlen=MAX_SEQUENCE_LENGTH)

# Preprocessing Test
print("Acquiring Test Data")
t0 = time()
df_test = pd.read_csv('../input/test.csv')
print("Done! Acquisition time:", time()-t0)

# Preprocessing
print("Preprocessing test data")
t0 = time()

i = 0
while True:
    if (i*BATCH_SIZE > df_test.shape[0]):
        break
    t1 = time()
    tk.fit_on_texts(list(df_test.iloc[i*BATCH_SIZE:(i+1)*BATCH_SIZE].question1.values.astype(str))
                    + list(df_test.iloc[i*BATCH_SIZE:(i+1)*BATCH_SIZE].question2.values.astype(str)))
    i += 1
    if (i % 100 == 0):
        print("Preprocessed Batch {0}/{1}, Word index size: {2}, ETC: {3} seconds".format(i,
                                                                int(df_test.shape[0]/BATCH_SIZE+1),
                                                                len(tk.word_index),
                                                                int(int(df_test.shape[0]/BATCH_SIZE+1)-i)*(time()-t1)))

word_index = tk.word_index

print("Done! Preprocessing time:", time()-t0)
print("Word index length:",len(word_index))

print('Shape of data tensor:', x1.shape, x2.shape)
print('Shape of label tensor:', labels.shape)

Using TensorFlow backend.


Acquiring Test Data
Done! Acquisition time: 5.820836067199707
Preprocessing test data
Preprocessed Batch 100/2291, Word index size: 107704, ETC: 420.59292912483215 seconds
Preprocessed Batch 200/2291, Word index size: 115958, ETC: 490.5876259803772 seconds
Preprocessed Batch 300/2291, Word index size: 121623, ETC: 498.6466920375824 seconds
Preprocessed Batch 400/2291, Word index size: 125845, ETC: 505.8784236907959 seconds
Preprocessed Batch 500/2291, Word index size: 129007, ETC: 527.0521574020386 seconds
Preprocessed Batch 600/2291, Word index size: 131381, ETC: 541.9919347763062 seconds
Preprocessed Batch 700/2291, Word index size: 133093, ETC: 441.7637176513672 seconds
Preprocessed Batch 800/2291, Word index size: 134319, ETC: 507.5879158973694 seconds
Preprocessed Batch 900/2291, Word index size: 135225, ETC: 390.2722487449646 seconds
Preprocessed Batch 1000/2291, Word index size: 135885, ETC: 352.63199067115784 seconds
Preprocessed Batch 1100/2291, Word index size: 136304, ETC: 3

Our model,
 this time, is a siamese deep neuronet, each "head" with a embedding layer, some time distributed dense layers along the sequence axis, the a sum layer that agregate the interpretations of the time distributed dense layers, a concatenation layer for the heads and finnaly, a dense neuronet with sigmoid activation at the end.

In [6]:
from keras.layers import Dense, Dropout, Lambda, TimeDistributed, PReLU, Merge, Activation, Embedding
from keras.models import Sequential, load_model
from keras.layers.normalization import BatchNormalization
from keras.callbacks import ModelCheckpoint
from keras import backend as K

def get_model(p_drop=0.0):
    encoder_1 = Sequential()
    encoder_1.add(Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                input_length=MAX_SEQUENCE_LENGTH))

    encoder_1.add(TimeDistributed(Dense(EMBEDDING_DIM, activation='relu')))
    encoder_1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(EMBEDDING_DIM,)))

    encoder_2 = Sequential()
    encoder_2.add(Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                input_length=MAX_SEQUENCE_LENGTH))

    encoder_2.add(TimeDistributed(Dense(EMBEDDING_DIM, activation='relu')))
    encoder_2.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(EMBEDDING_DIM,)))

    model = Sequential()
    model.add(Merge([encoder_1, encoder_2], mode='concat'))
    model.add(BatchNormalization())

    model.add(Dense(EMBEDDING_DIM))
    model.add(PReLU())
    model.add(Dropout(p_drop))
    model.add(BatchNormalization())

    model.add(Dense(EMBEDDING_DIM))
    model.add(PReLU())
    model.add(Dropout(p_drop))
    model.add(BatchNormalization())

    model.add(Dense(1))
    model.add(Activation('sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

Usually is a good idea to search over some parms space for the optimum hyper parameters. Here we do this with gridsearch using sklearn.

In [7]:
'''
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

print("Searching for optimum hyper parameters.")
t0 = time()
model = KerasClassifier(build_fn=get_model, verbose=0)

# define the grid search parameters
batch_size = [128, 256, 512, 1024, 2048]
p_drop = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
param_grid = dict(batch_size=batch_size, p_drop=p_drop)

grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit([x1, x2], labels)

# summarize results
print("Best: {0} using {1}".format(grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
print("Done! Time elapsed:", time()-t0)
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
'''
    
# Usually, this is the code for gridsearch a keras model with sklearn, however for the merged model,
# i got this error. As I can't find a solution for this error on the web and don't have the time to dig
# deeper, I'll appreciate to hear your insights about how to do it, if you have some to share!

'\nfrom keras.wrappers.scikit_learn import KerasClassifier\nfrom sklearn.model_selection import GridSearchCV\n\nprint("Searching for optimum hyper parameters.")\nt0 = time()\nmodel = KerasClassifier(build_fn=get_model, verbose=0)\n\n# define the grid search parameters\nbatch_size = [128, 256, 512, 1024, 2048]\np_drop = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]\nparam_grid = dict(batch_size=batch_size, p_drop=p_drop)\n\ngrid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)\ngrid_result = grid.fit([x1, x2], labels)\n\n# summarize results\nprint("Best: {0} using {1}".format(grid_result.best_score_, grid_result.best_params_))\nmeans = grid_result.cv_results_[\'mean_test_score\']\nstds = grid_result.cv_results_[\'std_test_score\']\nparams = grid_result.cv_results_[\'params\']\nprint("Done! Time elapsed:", time()-t0)\nfor mean, stdev, param in zip(means, stds, params):\n    print("%f (%f) with: %r" % (mean, stdev, param))\n'

In [8]:
model = get_model(p_drop=0.2)
checkpoint = ModelCheckpoint('weights.h5', monitor='val_acc', save_best_only=True, verbose=2)

model.fit([x1, x2], y=labels, batch_size=384, nb_epoch=1,
                 verbose=1, validation_split=0.1, shuffle=True, callbacks=[checkpoint])



Train on 363861 samples, validate on 40429 samples
Epoch 1/1


<keras.callbacks.History at 0x7f94a88657f0>

As kaggle have time limit for running kernels, this models trains just one epoch and is pretty small. A bigger/depper model with proper training time will perform better.

In [9]:
# Load best model
#print("Loading best trained model")
#model = load_model(WEIGHTS_PATH)

# Predicting
i = 0
predictions = empty([df_test.shape[0],1])
while True:
    t1 = time()
    if (i * BATCH_SIZE > df_test.shape[0]):
        break
    x1 = pad_sequences(tk.texts_to_sequences(
        df_test.question1.iloc[i * BATCH_SIZE:(i + 1) * BATCH_SIZE].values.astype(str)), maxlen=MAX_SEQUENCE_LENGTH)
    x2 = pad_sequences(tk.texts_to_sequences(
        df_test.question2.iloc[i * BATCH_SIZE:(i + 1) * BATCH_SIZE].values.astype(str)), maxlen=MAX_SEQUENCE_LENGTH)
    try:
        predictions[i*BATCH_SIZE:(i+1)*BATCH_SIZE] = model.predict([x1, x2], batch_size=BATCH_SIZE, verbose=0)
    except ValueError:
        predictions[i*BATCH_SIZE:] = model.predict([x1, x2], batch_size=BATCH_SIZE, verbose=0)[:(df_test.shape[0]-i*BATCH_SIZE)]

    i += 1
    if (i % 1000 == 0):
        print("Predicted Batch {0}/{1}, ETC: {2} seconds".format(i,
                                                                int(df_test.shape[0]/BATCH_SIZE),
                                                                int(int(df_test.shape[0]/BATCH_SIZE+1)-i)*(time()-t1)))


df_test["is_duplicate"] = predictions


df_test[['test_id','is_duplicate']].to_csv(SUBMIT_PATH, header=True, index=False)
print("Done!")
print("Submission file saved to:",check_output(["ls", SUBMIT_PATH]).decode("utf8"))

Predicted Batch 1000/2290, ETC: 91.08523726463318 seconds
Predicted Batch 2000/2290, ETC: 17.796544790267944 seconds
Done!
Submission file saved to: lstm_submission_1.csv



The test set preprocessing and predictions are done in batches, just because it does not fit on my pc memory.

If there is some flaw on my code or you guys have some comments, I will be glad to listen to it!
Thanks for your time!