# Quora Question Pair

The question origins from the kaggle competition: https://www.kaggle.com/c/quora-question-pairs
The target of the project is that given a pair of questions, we need to determine if they are duplicated questions or not. The two questions are defined to be duplicated if they can be solved by the same solution. The difficulties of the questions are: 
1. Two sentences are not guaranteed to be duplicated even if one sentence is almost a copy of another one.
2. The same question can be presented as different forms. 
3. Even though the two questions are not semantically equivalent, they may still be solved by the same answer. 
Because of the first and the second difficulty, we need to find ways to identify if the two sentences are semantically equivalent. However, the third difficulty determines that the above is not enough and we need to feed all features into the ML model for prediction. 

Reference: 
1. https://www.aclweb.org/anthology/K15-1013
2. https://cs.stanford.edu/~quocle/paragraph_vector.pdf

In [1]:
from keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
import gc

Using TensorFlow backend.


# Read and concatenate train and test data

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [None]:
train['qid1'].astype(int, copy=False)
train['qid2'].astype(int, copy=False)

In [4]:
train_size = train.shape[0]
test_size = test.shape[0]
print("train_size: %d" % train_size) 
print("test_size: %d" % test_size) 

train_size: 404290
test_size: 2345796


In [5]:
data = pd.concat([train,test],ignore_index=True).fillna("")
data.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,id,is_duplicate,qid1,qid2,question1,question2,test_id
0,0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,
1,1,0,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,
2,2,0,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,
3,3,0,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,
4,4,0,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,


# Tokenize the document

In [6]:
raw_texts = pd.concat([data['question1'][:train_size], data['question2'][:train_size], data['question1'][train_size:], data['question2'][train_size:]], ignore_index=True)
raw_texts = raw_texts.astype(str).str.lower()

In [7]:
text_train_size = 2 * train_size
text_size = test.shape[0]

In [8]:
tk = Tokenizer(lower = True)
tk.fit_on_texts(raw_texts)

In [9]:
train_tk_q1 = tk.texts_to_sequences(raw_texts[:train_size])
train_tk_q2 = tk.texts_to_sequences(raw_texts[train_size:text_train_size])
test_tk_q1 = tk.texts_to_sequences(raw_texts[text_train_size:text_train_size+text_size])
test_tk_q2 = tk.texts_to_sequences(raw_texts[text_train_size+text_size:])

In [10]:
tk_q1 = train_tk_q1 + test_tk_q1
tk_q2 = train_tk_q2 + test_tk_q2

# Similarity features (copy)
Functions that generate features help to determine if one sentence is a copy of another sentence. 

In [11]:
# number of shared words
def num_codes(tk):
    ans = []
    for row in range(len(tk)):
        ans.append(len(tk[row]))
    return ans



# number of unique words     
def num_unique_codes(tk):
    ans = []
    for row in range(len(tk)):
        ans.append(len(set(tk[row])))
    return list(ans)


# number of letters
def num_letters(tk, tkcode):
    ans = []
    index_word = tk.index_word
    for row in range(len(tkcode)):
        num = 0
        for windex in tkcode[row]:
            num += len(index_word[windex])
        ans.append(num)
    return ans


# Jaccard similarity
def Jaccard(tk_q1, tk_q2):
    ans = []
    for row in range(len(tk_q1)):
        q1 = tk_q1[row]
        q2 = tk_q2[row]
        inter1 = len([c for c in q1 if c in q2])
        inter2 = len([c for c in q2 if c in q1])
        inter = inter1+inter2
        union = len(q1)+len(q2)
        if union == 0:
            ans.append(1)
        else:
            ans.append(inter/union*100)
    return ans


# Jaccard similarity with 2-shingles
def Jaccard_2_shingles(tk_q1, tk_q2):
    ans = []
    for row in range(len(tk_q1)):
        q1 = tk_q1[row]
        q2 = tk_q2[row]
        q1_2_shingles = []
        for i in range(len(q1)-1):
            q1_2_shingles.append((q1[i], q1[i+1]))
        q2_2_shingles = []
        for i in range(len(q2)-1):
            q2_2_shingles.append((q2[i], q2[i+1]))
        inter1 = len([c for c in q1_2_shingles if c in q2_2_shingles])
        inter2 = len([c for c in q2_2_shingles if c in q1_2_shingles])
        inter = inter1+inter2
        union = len(q1_2_shingles)+len(q2_2_shingles)
        if union == 0:
            ans.append(1)
        else:
            ans.append(inter/union*100)
    return ans

In [12]:
data['question1_num_codes'] = num_codes(tk_q1)
data['question2_num_codes'] = num_codes(tk_q2)
#data['question1_num_words'] = num_words(data['question1'])
#data['question2_num_words'] = num_words(data['question2'])
data['question1_num_unique_words'] = num_unique_words(tk_q1)
data['question2_num_unique_words'] = num_unique_words(tk_q2)
data['question1_words_vs_unique'] = data['question1_num_unique_words'] / data['question1_num_words'] * 100
data['question2_words_vs_unique'] = data['question2_num_unique_words'] / data['question2_num_words'] * 100
data['question1_num_letter1'] = num_letters(tk, tk_q1)
data['question2_num_letter1'] = num_letters(tk, tk_q2)
data['q1_q2_nw_ratio'] = data['question1_num_words'] / data['question2_num_words'] * 100
data['q1_q2_nw_unique_ratio'] = data['question1_num_unique_words'] / data['question2_num_unique_words'] * 100
data['Jaccard'] = Jaccard(tk_q1, tk_q2)
data['Jaccard_2_singles'] = Jaccard_2_shingles(tk_q1, tk_q2)

# Sentence embedding
Our focus for the project is on the creation of sentence embedding and generate similarity score from the vector presentation of the sentences. Two methods have been studied and used: 
1. https://cs.stanford.edu/~quocle/paragraph_vector.pdf. doc2vec library can be used to generate sentence embedding
2. https://www.aclweb.org/anthology/K15-1013. Implemented by Shu Hong.

## doc2vec
Since the method is unsupervised learning and test data is much more than training data, we use train data and test data for learning the sentence embedding. We can expect further improvement if we introduce all the questions from the quora platform. 

In [13]:
docs = {}
qid = 537934
for index, drow in data.iterrows():
    if index < train_size:
        docs[tuple(drow['question1'].lower().split())] = drow['qid1']
        docs[tuple(drow['question2'].lower().split())] = drow['qid2']
    else:
        doc = tuple(drow['question1'].lower().split())
        if doc in docs:
            data.at[index, 'qid1'] = docs[doc]
        else:
            data.at[index, 'qid1'] = qid
            docs[doc] = qid
            qid += 1
        doc = tuple(drow['question2'].lower().split())
        if doc in docs:
            data.at[index, 'qid2'] = docs[doc]
        else:
            data.at[index, 'qid2'] = qid
            docs[doc] = qid
            qid += 1

In [14]:
from gensim import utils
from gensim.models.doc2vec import LabeledSentence
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec

In [15]:
# tags is used to distinguish among sentences
# however, tags or integer index can be used to get sentence vectors 
labeled_questions = []
for index, drow in data.iterrows():
    labeled_questions.append(TaggedDocument(str(drow['question1']).lower().split(), [str(drow['qid1'])]))
    labeled_questions.append(TaggedDocument(str(drow['question2']).lower().split(), [str(drow['qid2'])]))

In [16]:
model = Doc2Vec(dm = 1, min_count=1, window=10, vector_size=150, negative=10 ,dbow_words=1,hs=1,workers=-1)
model.build_vocab(labeled_questions)

In [17]:
for epoch in range(50):
    model.train(labeled_questions,epochs=model.epochs,total_examples=model.corpus_count)
    print("Epoch #{} is complete.".format(epoch+1))

Epoch #1 is complete.
Epoch #2 is complete.
Epoch #3 is complete.
Epoch #4 is complete.
Epoch #5 is complete.
Epoch #6 is complete.
Epoch #7 is complete.
Epoch #8 is complete.
Epoch #9 is complete.
Epoch #10 is complete.
Epoch #11 is complete.
Epoch #12 is complete.
Epoch #13 is complete.
Epoch #14 is complete.
Epoch #15 is complete.
Epoch #16 is complete.
Epoch #17 is complete.
Epoch #18 is complete.
Epoch #19 is complete.
Epoch #20 is complete.
Epoch #21 is complete.
Epoch #22 is complete.
Epoch #23 is complete.
Epoch #24 is complete.
Epoch #25 is complete.
Epoch #26 is complete.
Epoch #27 is complete.
Epoch #28 is complete.
Epoch #29 is complete.
Epoch #30 is complete.
Epoch #31 is complete.
Epoch #32 is complete.
Epoch #33 is complete.
Epoch #34 is complete.
Epoch #35 is complete.
Epoch #36 is complete.
Epoch #37 is complete.
Epoch #38 is complete.
Epoch #39 is complete.
Epoch #40 is complete.
Epoch #41 is complete.
Epoch #42 is complete.
Epoch #43 is complete.
Epoch #44 is complet

In [18]:
scores = []
for index, drow in data.iterrows():
    s1 = str(drow['question1'])
    s2 = str(drow['question2'])
    if len(s1) == 0 or len(s2) == 0:
        if len(s1) == len(s2):
            score = 1
        else:
            score = 0
    else:
        score = model.docvecs.similarity(str(drow['qid1']), str(drow['qid2']))
    scores.append(score)
data['sent_similarity'] = scores

## Get word embedding and sentence embedding features
The word embedding and the sentence embedding are also used as features. 

In [19]:
def make_feature_vec(doc, model, word_set):
    feature_vec = np.zeros((model.vector_size,))
    nwords = 0
    for word in doc:
        if word in word_set:
            nwords += 1
            feature_vec = np.add(feature_vec, model[word])
    if nwords > 0:
        feature_vec = np.divide(feature_vec, nwords)
    return feature_vec


def get_avg_feature_vecs(docs, model, word_set):
    counter = 0
    feature_vecs = np.zeros((len(docs), model.vector_size))
    for doc in docs:
        feature_vecs[counter] = make_feature_vec(doc, model, word_set)
        counter += 1
    return feature_vecs


def get_doc_vecs(data, tag_name, model):
    counter = 0
    question = ''
    if tag_name == 'qid1':
        question = 'question1'
    else:
        question = 'question2'
    docs = data[question]
    doc_vecs = np.zeros((len(docs), model.vector_size))
    for index, doc in enumerate(docs):
        doc_vecs[counter] = model.docvecs[str(data[tag_name][index])]
        counter += 1
    return doc_vecs

In [20]:
word_set = model.wv.vocab
word_emb_q1 = get_avg_feature_vecs(data['question1'], model, word_set)
word_emb_q2 = get_avg_feature_vecs(data['question2'], model, word_set)
word_emb = np.hstack([word_emb_q1, word_emb_q2])

In [21]:
doc_emb_q1 = get_doc_vecs(data, 'qid1', model)
doc_emb_q2 = get_doc_vecs(data, 'qid2', model)
doc_emb = np.hstack([doc_emb_q1, doc_emb_q2])

In [22]:
'''
features = ['question1_num_codes', 'question1_num_words', 'question1_num_unique_words', 
            'question1_words_vs_unique', 'question1_num_letter1', 'question2_num_codes', 
            'question2_num_words', 'question2_num_unique_words', 
            'question2_words_vs_unique', 'question2_num_letter1', 
            'q1_q2_nw_ratio', 'q1_q2_nw_unique_ratio', 'Jaccard', 'Jaccard_2_singles','sent_similarity']
'''
features = ['question1_num_codes', 'question1_num_words', 'question1_num_unique_words', 
            'question1_words_vs_unique', 'question1_num_letter1', 'question2_num_codes', 
            'question2_num_words', 'question2_num_unique_words', 
            'question2_words_vs_unique', 'question2_num_letter1', 
            'q1_q2_nw_ratio', 'q1_q2_nw_unique_ratio', 'Jaccard', 'Jaccard_2_singles','sent_similarity']

In [23]:
data = np.hstack([data[features].values,word_emb,doc_emb])

# restore the data

In [2]:
#import pickle

In [20]:
#a = len(data)//2
#pickle.dump(data1[:a], open('data_03.pkl', 'wb'), protocol=4)
#pickle.dump(data1[a:], open('data_04.pkl', 'wb'), protocol=4)

In [3]:
#data_1 = pickle.load(open('data_03.pkl', 'rb'))
#data_2 = pickle.load(open('data_04.pkl', 'rb'))

In [4]:
#data = np.vstack([data_1, data_2])
#gc.collect()

0

Our implementation of the supervised learning based sentence embedding is in Word2Vec_Dup_Detection.ipynb file. The method is implemented by Shu Hong. We only use the similarities score as a feature. 

In [5]:
train1 = pd.read_csv('sh_similarity_train.csv')
test1 = pd.read_csv('sh_similarity.csv')

In [6]:
sim = pd.concat([train1, test1], ignore_index=True)

In [7]:
def getNum(s):
    return s.replace('[[', '').replace(']]', '')


sim = sim['similarity'].apply(getNum)

In [13]:
data1 = np.zeros((data.shape[0], data.shape[1]+1))
data1[:,:-1] = data
data1[:,-1] = sim

In [9]:
train_x = data1[:train_size]
train_y = train['is_duplicate']
test_x = data1[train_size:]

# Tree models
lightGBM and XGboost are used for prediction. lightGBM can have better result and can obtain a logloss of 0.150058. The score can be further improved if we do model tuning and stacking. 

In [10]:
import lightgbm as lgb
import xgboost as xgb

In [11]:
lgb_params = dict()
lgb_params['objective'] = 'binary'
lgb_params['learning_rate'] = 0.1
lgb_params['num_leaves'] = 63
lgb_params['max_depth'] = 15
lgb_params['min_gain_to_split '] = 0.1
#lgb_params['subsample'] = 0.7
lgb_params['colsample_bytree'] = 0.7
lgb_params['min_sum_hessian_in_leaf'] = 0.001
#lgb_params["boosting"] = 'dart'
#lgb_params['lambda_l1'] = 0.01 
lgb_params['seed']=42

In [12]:
lgb_cv = lgb.cv(lgb_params,
                lgb.Dataset(train_x,
                            label=train_y
                            ),
                num_boost_round=20000,
                nfold=5,
                stratified=True,
                shuffle=True,
                early_stopping_rounds=50,
                seed=42,
                verbose_eval=500)

In [13]:
best_score = min(lgb_cv['binary_logloss-mean'])
best_iteration = len(lgb_cv['binary_logloss-mean'])
print ('Best iteration: %d, best score: %f' % (best_iteration, best_score))

Best iteration: 91, best score: 0.150058


In [15]:
xgb_params = {}
xgb_params["objective"] = "binary:logistic"
xgb_params["eta"] = 0.02
xgb_params["seed"] = 1234
xgb_params["max_depth"] = 15
xgb_params["metric"] = 'logloss'
xgb_params['silent'] = 1

In [16]:
xgb_cv = xgb.cv(xgb_params,
    xgb.DMatrix(train_x,
    label=train_y
    ),
    num_boost_round=100,
    nfold=5,
    metrics='logloss',
    early_stopping_rounds=50,
    verbose_eval=10)
best_score = xgb_cv['test-%s-mean' % ('logloss')].min()
best_iteration = len(xgb_cv)
print (', best_score: %f, best_iteration: %d' % (best_score, best_iteration))

[0]	train-logloss:0.676052+3.32e-05	test-logloss:0.676879+2.51762e-05
[10]	train-logloss:0.535931+0.000138905	test-logloss:0.544239+0.000205403
[20]	train-logloss:0.435226+0.000240436	test-logloss:0.450451+0.000340138
[30]	train-logloss:0.360004+0.000293356	test-logloss:0.381689+0.000421217
[40]	train-logloss:0.302277+0.000281019	test-logloss:0.330028+0.000504163
[50]	train-logloss:0.256867+0.000272261	test-logloss:0.290615+0.000591839
[60]	train-logloss:0.220614+0.000316419	test-logloss:0.260243+0.000618138
[70]	train-logloss:0.191413+0.000383474	test-logloss:0.236593+0.000662395
[80]	train-logloss:0.167605+0.000440586	test-logloss:0.218111+0.000717293
[90]	train-logloss:0.147908+0.000330174	test-logloss:0.203617+0.00077036
[99]	train-logloss:0.132921+0.000436975	test-logloss:0.193238+0.000812791
, best_score: 0.193238, best_iteration: 100


# Prediction for test data

In [28]:
from sklearn.model_selection import train_test_split


x_train, x_dev, y_train, y_dev = train_test_split(train_x, train_y)
train_data = lgb.Dataset(x_train, label=y_train, categorical_feature=[0,1])
test_data = lgb.Dataset(x_dev, label=y_dev, categorical_feature=[0,1])

In [29]:
model = lgb.train(lgb_params, train_data, valid_sets=test_data, num_boost_round=20000, early_stopping_rounds=50, verbose_eval=100)



Training until validation scores don't improve for 50 rounds.
[100]	valid_0's binary_logloss: 0.152377
Early stopping, best iteration is:
[89]	valid_0's binary_logloss: 0.152092


In [31]:
y_pred = model.predict(test_x)

In [33]:
sub = pd.DataFrame()
sub['test_id'] = test['test_id']
sub['is_duplicate'] = y_pred
sub.to_csv('submit.csv', index=False)