# Quora Question Pairs

The objective of this competition is to decide if a pair of questions has the same meaning. The point is to help avoid question duplication on Quora. The evaluation method uses Log Loss, defined as

$$ {\rm Log\_Loss} = -\frac{1}{n} \sum_{i=1}^n \left(y_i \log \hat{y}_i + (1-y_i)\log( 1-\hat{y}_i) \right) $$

Here $y_i$ is the truth of whether pair number $i$ is a dupplicate (i.e., equal to either $0$ or $1$) and $\hat{y}_i$ is the predicted probability that the pair is a duplicate. A perfect score is zero.

## Setup
Import useful packages for data analysis and plotting.

In [2]:
import pandas as pd
import numpy as np
#import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(context = 'notebook', font_scale = 1.5, rc={'figure.figsize':(10, 6)})

from IPython.display import display #for displaying multiple outputs from a single cell


import sklearn
import sklearn.neural_network
import sklearn.model_selection
import sklearn.decomposition

## Import Data

Import the training data.

In [36]:
train_data = pd.read_csv('train.csv')

In [37]:
train_data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


## Basic Features
We extract some basic features from the questions that can be used to create a statistical model.

In [47]:
#Length of the question (in characters)
train_data['q1Len'] = train_data.question1.apply(lambda x: len(str(x)))
train_data['q2Len'] = train_data.question2.apply(lambda x: len(str(x)))
#Number of words in the question
train_data['q1NumWords'] = train_data.question1.apply(lambda x: len(str(x).split()))
train_data['q2NumWords'] = train_data.question2.apply(lambda x: len(str(x).split()))

In [48]:
train_data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1Len,q2Len,q1NumWords,q2NumWords
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,66,57,14,12
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,51,88,8,13
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,73,59,14,10
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,50,65,11,9
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,76,39,13,7


In [46]:
len('hello world'.split())

2

## Create Submission

In [62]:
sample_pred = pd.read_csv('./sample_submission.csv')
sample_pred.head()

Unnamed: 0,test_id,is_duplicate
0,0,1
1,1,1
2,2,1
3,3,1
4,4,1


In [None]:
#preds.to_csv('file_name.csv', index=False)