## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [91]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None) # to display full text without truncations

In [2]:
df = pd.read_csv("data/train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [3]:
# quick data overview
display(df.info())
display(df.describe())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


None

Unnamed: 0,id,qid1,qid2,is_duplicate
count,404290.0,404290.0,404290.0,404290.0
mean,202144.5,217243.942418,220955.655337,0.369198
std,116708.614502,157751.700002,159903.182629,0.482588
min,0.0,1.0,2.0,0.0
25%,101072.25,74437.5,74727.0,0.0
50%,202144.5,192182.0,197052.0,0.0
75%,303216.75,346573.5,354692.5,1.0
max,404289.0,537932.0,537933.0,1.0


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in share market in india?,What is the step by step guide to invest in share market?,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Diamond?,What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?,0
2,2,5,6,How can I increase the speed of my internet connection while using a VPN?,How can Internet speed be increased by hacking through DNS?,0
3,3,7,8,Why am I mentally very lonely? How can I solve it?,"Find the remainder when [math]23^{24}[/math] is divided by 24,23?",0
4,4,9,10,"Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?",Which fish would survive in salt water?,0


### Exploration

**Checking for Null and Duplicates**. Via `info()`, already identified 3 NULL fields. Will delete them as they fractional compared to the entire dataset.

In [4]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 404287 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404287 non-null  int64 
 1   qid1          404287 non-null  int64 
 2   qid2          404287 non-null  int64 
 3   question1     404287 non-null  object
 4   question2     404287 non-null  object
 5   is_duplicate  404287 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 21.6+ MB


In [5]:
# checking for duplicates 
df.duplicated().sum()

0

In [6]:
df[df['is_duplicate']==1].sample(n=5)
df[df['is_duplicate']==0].sample(n=5)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
238240,238240,349562,310684,How secure/reliable are porn websites if you use your credit card?,Theft: How does someone use a stolen credit card?,0
139736,139736,179185,126972,How do the tourist attractions on the Scandinavian Highlands compare to attractions in Ukraine?,How do the tourist attractions on the Scandinavian Highlands compare to attractions in Bulgaria?,0
266198,266198,99455,211706,How do you spot a genius?,How do people with IQs of 60-80 think?,0
204601,204601,307542,307543,Are human beings made for monogamy or is cheating inevitable?,How does it feel to be cheated on by your partner?,0
288005,288005,1761,45938,What is the hardest thing(s) about raising children in Ukraine?,What is the hardest thing(s) about raising children in Russia?,0


#### Creating test_df to hold out as separately

In [19]:
test_df = df.sample(frac=0.2)

filter = df.index.isin(test_df.index.tolist())
train_df = df[~filter]

# checking for intersection
set (train_df.index) & set(test_df.index) # should be empty to pass

set()

set()

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [80]:
import string

from nltk.corpus import stopwords
stopwords_eng = stopwords.words('english')

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize # to get other parts of speach

import spacy
nlp = spacy.load("en_core_web_sm")

In [49]:
len(train_df)

323430

In [89]:
# sample text for testing
text = train_df['question1'][:10]
text

0                         What is the step by step guide to invest in share market in india?
1                                        What is the story of Kohinoor (Koh-i-Noor) Diamond?
4               Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?
5     Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
6                                                                        Should I buy tiago?
7                                                             How can I be a good geologist?
8                                                            When do you use シ instead of し?
9                               Motorola (company): Can I hack my Charter Motorolla DCX3400?
10                                 Method to find separation of slits using fresnel biprism?
11                                               How do I read and find my YouTube comments?
Name: question1, dtype: object

In [92]:
def preprocessing (documents):
    cleaned_documents = []
    for text in documents:

        # lower case
        text = text.lower()
        # stopwords clearning
        stopwords = stopwords_eng
        text = " ".join([word for word in text.split() if word not in stopwords_eng])

        # removing punctuation
        punctuation = string.punctuation
        text = "".join([char for char in list(text) if char not in string.punctuation])

        # normalizing
        text = nlp(text)

        # lemming
        text = " ".join([token.lemma_ for token in text])
        
        cleaned_documents.append(text)
        
    cleaned_documents = np.array(cleaned_documents)


    
    return cleaned_documents
    

    
preprocessing(text)

array(['step step guide invest share market india',
       'story kohinoor kohinoor diamond',
       'one dissolve water quikly sugar salt methane carbon di oxide',
       'astrology capricorn sun cap moon cap risingwhat say -PRON-',
       'buy tiago', 'good geologist', 'use シ instead し',
       'motorola company hack charter motorolla dcx3400',
       'method find separation slit use fresnel biprism',
       'read find youtube comment'], dtype='<U60')

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [94]:
# tf-idf from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(preprocessing(text))


scipy.sparse._csr.csr_matrix

In [96]:
X_train.toarray().shape

(10, 48)

In [97]:
preprocessing(text).shape

(10,)

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [98]:
import gensim
import gensim.downloader as api
dataset = api.load("text8")
data = [d for d in dataset]
def tagged_document(list_of_list_of_words):
    for i, list_of_words in enumerate(list_of_list_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words, [i])
data_for_training = list(tagged_document(data))
print(data_for_training[:1])
model = gensim.models.doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
model.build_vocab(data_training)
model.train(data_training, total_examples=model.corpus_count, epochs=model.epochs)
print(model.infer_vector(['violent', 'means', 'to', 'destroy', 'the','organization']))


[TaggedDocument(words=['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working', 'class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans', 'culottes', 'of', 'the', 'french', 'revolution', 'whilst', 'the', 'term', 'is', 'still', 'used', 'in', 'a', 'pejorative', 'way', 'to', 'describe', 'any', 'act', 'that', 'used', 'violent', 'means', 'to', 'destroy', 'the', 'organization', 'of', 'society', 'it', 'has', 'also', 'been', 'taken', 'up', 'as', 'a', 'positive', 'label', 'by', 'self', 'defined', 'anarchists', 'the', 'word', 'anarchism', 'is', 'derived', 'from', 'the', 'greek', 'without', 'archons', 'ruler', 'chief', 'king', 'anarchism', 'as', 'a', 'political', 'philosophy', 'is', 'the', 'belief', 'that', 'rulers', 'are', 'unnecessary', 'and', 'should', 'be', 'abolished', 'although', 'there', 'are', 'differing', 'interpretations', 'of', 'what', 'this', 'means', 'anarchism', 'also', 'refers'

NameError: name 'data_training' is not defined