# Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [1]:
import pandas as pd

In [6]:
df = pd.read_csv("./train.csv") #.sample(1000, random_state=0)
df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
...,...,...,...,...,...,...
404285,404285,433578,379845,How many keywords are there in the Racket prog...,How many keywords are there in PERL Programmin...,0
404286,404286,18840,155606,Do you believe there is life after death?,Is it true that there is life after death?,1
404287,404287,537928,537929,What is one coin?,What's this coin?,0
404288,404288,537930,537931,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0


#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

# Exploration

In [7]:
df['is_duplicate'].value_counts()

0    255027
1    149263
Name: is_duplicate, dtype: int64

In [9]:
# No exact duplicates
df[df['question1'] == df['question2']]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate


In [10]:
# Distribution of word frequency

In [None]:
# Topic modelling

# *Train test split*

In [353]:
# Account for imbalance in data set

from sklearn.model_selection import train_test_split
df_train_0, df_test_0 = train_test_split(df[df['is_duplicate'] == 0], train_size=0.8,random_state=0)
df_train_1, df_test_1 = train_test_split(df[df['is_duplicate'] == 1], train_size=0.8,random_state=0)

print(df_train_0.shape)
print(df_test_0.shape)
print(df_train_1.shape)
print(df_test_1.shape)

(204021, 6)
(51006, 6)
(119410, 6)
(29853, 6)


In [359]:
df_train = pd.concat([df_train_0, df_train_1]).sample(frac=1)
df_test = pd.concat([df_test_0, df_test_1]).sample(frac=1)
y_train = df_train['is_duplicate']
y_test = df_test['is_duplicate']

# Make sure no rows got dropped
print(len(df_train)+len(df_test)==len(df))
df_test.head()

True


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
257849,257849,129883,373253,I enjoy tending to my self inflicted injuries....,Am I doing it wrong? Is there an alternative w...,0
140751,140751,38090,223615,"How do you say ""I'm sorry"" in Korean? Is there...","In Korean, how do you say ""work""?",0
277047,277047,396077,396078,What are advantages of projector head lights o...,I'm looking for a new (or lightly used) car. I...,0
387603,387603,118351,46690,What's Linux?,What is the use of Linux?,1
74206,74206,127210,127211,Which design softwares does the aerospace indu...,Is the right to police to pull us in chowky if...,0


# Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

# Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

## *Forest Models*

In [138]:
import timeit

In [21]:
# Take a subsample to test word processing functions
df.loc[0:2,['question1', 'question2']]

Unnamed: 0,question1,question2
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...


In [222]:
# 2022-11-17 9:05 Attempt with a single call of .apply; no good
# from nltk.tokenize import word_tokenize
# import string
# from nltk.corpus import stopwords
# from nltk.stem.porter import PorterStemmer
# from nltk.stem import WordNetLemmatizer

# def process_doc(doc):
    
#     values = doc
#     # print(type(doc))
#     print(values,'\n\n')
#     # # remove special characters
#     # doc = ''.join([char for char in doc if not char in string.punctuation])

#     # # Split text into single words (also gets rid of extra white spaces)
#     # words = word_tokenize(doc)

#     # # Convert to lower case
#     # words = [word.lower() for word in words]
    
#     # # # Remove stop words aside from the 5 why's + how
#     # stop_words = set(stopwords.words('english')) - set(['who', 'what', 'where', 'when', 'how', 'why'])
#     # words = [w for w in words if not w in stop_words]

#     # # Stem
#     # # porter = PorterStemmer()
#     # # words = [porter.stem(word) for word in words]
#     # wnl = WordNetLemmatizer()
#     # words = [wnl.lemmatize(word) for word in words]

#     # return words

# # %%timeit
# def process_pairs(df):
#     df = df.apply(lambda x: process_doc(x))

#     # return df


# process_pairs(df.loc[0:1])

In [None]:
# # 2022-11-17 8:49 multiple .apply
# from nltk.tokenize import word_tokenize
# import string
# from nltk.corpus import stopwords
# from nltk.stem.porter import PorterStemmer
# from nltk.stem import WordNetLemmatizer

# def process_doc(doc):

#     # remove special characters
#     doc = ''.join([char for char in doc if not char in string.punctuation])

#     # Split text into single words (also gets rid of extra white spaces)
#     words = word_tokenize(doc)

#     # Convert to lower case
#     words = [word.lower() for word in words]
    
#     # # Remove stop words aside from the 5 why's + how
#     stop_words = set(stopwords.words('english')) - set(['who', 'what', 'where', 'when', 'how', 'why'])
#     words = [w for w in words if not w in stop_words]

#     # Stem
#     # porter = PorterStemmer()
#     # words = [porter.stem(word) for word in words]
#     wnl = WordNetLemmatizer()
#     words = [wnl.lemmatize(word) for word in words]

#     return words

# # %%timeit
# def process_pairs(df):
#     tokens1 = df['question1'].apply(lambda x: process_doc(x))
#     tokens2 = df['question2'].apply(lambda x: process_doc(x))

#     df2 = pd.concat([df['is_duplicate'], 
#         tokens1,
#         tokens2
#         ], axis=1)

#     df2['length_diff'] = abs(len(tokens1) - len(tokens2))

#     return df2


# process_pairs(df.loc[0:2])

Unnamed: 0,is_duplicate,question1,question2,length_diff
0,0,"[what, step, step, guide, invest, share, marke...","[what, step, step, guide, invest, share, market]",0
1,0,"[what, story, kohinoor, kohinoor, diamond]","[what, would, happen, indian, government, stol...",0
2,0,"[how, increase, speed, internet, connection, u...","[how, internet, speed, increased, hacking, dns]",0


In [283]:

from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [293]:
set(stopwords.words('english')) - set(['who', 'what', 'where', 'when', 'how', 'why'])

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's'

In [322]:
# Forest models
    # Data prep using .itertuples()
# %%time

def process_doc(doc):

    # remove special characters
    doc = ''.join([char for char in doc if not char in string.punctuation])

    # Split text into single words (also gets rid of extra white spaces)
    words = word_tokenize(doc)

    # Convert to lower case
    words = [word.lower() for word in words]

    # Stem
    # porter = PorterStemmer()
    # words = [porter.stem(word) for word in words]
    wnl = WordNetLemmatizer()
    words = [wnl.lemmatize(word) for word in words]

    return words

# feature engineering
def process_pairs(df):
    """
    Perform feature engineering on pairs of questions for forest models.
    """

    length_diff = []
    n_common_words = []
    df2 = pd.DataFrame()
    for index, q1, q2 in df[['question1', 'question2']].itertuples():
        # Preprocess
        tokens1 = process_doc(q1)
        tokens2 = process_doc(q2)
        # print(tokens1)
        # print(tokens2)
        # print()

        # Get number of words for q1 and a2 and the ratio of these values
        df2.loc[index,'Q1 length'] = int(len(tokens1))
        df2.loc[index,'Q2 length'] = len(tokens2)
        df2.loc[index,'length ratio'] = min([len(tokens1), len(tokens2)]) / max([len(tokens1), len(tokens2)])

        # Number of common words between q1 and q2
        common_words = set(tokens1) & set(tokens2)
        df2.loc[index,'N common words'] = len(common_words)

        # Common words between q1 and q2 as a percentage of longest question in the pair
        df2.loc[index,'common words percentage'] = len(common_words) / max([len(tokens1), len(tokens2)])

        # Same last word
        df2.loc[index,'same last word'] = int(tokens1[-1] == tokens2[-1])

        # Same frequency of the word 'not'
        df2.loc[index,'not_count1'] = sum(word=='not' for word in tokens1)
    
        # # Remove stop words aside from the 5 why's + how
        stop_words = set(stopwords.words('english')) - set(['who', 'what', 'where', 'when', 'how', 'why'])
        words1 = [w for w in tokens1 if not w in stop_words]
        words2 = [w for w in tokens2 if not w in stop_words]

        # Number of common words between q1 and q2 with stop words removed
        common_words_nonstop = set(words1) & set(words2)
        df2.loc[index,'N common non-stop words'] = len(common_words_nonstop)
        df2.loc[index,'common non-stop words percentage'] = len(common_words_nonstop) / max(
            [len(words1), len(words2)])
        
        # print(words2,'\n')
        
    # length difference
    df2['Q1 Q1 length difference'] = abs(df2['Q1 length'] - df2['Q2 length'])

    return df2
        

process_pairs(df.head())

Unnamed: 0,Q1 length,Q2 length,length ratio,N common words,common words percentage,same last word,not_count1,N common non-stop words,common non-stop words percentage,Q1 Q1 length difference
0,14.0,12.0,0.857143,11.0,0.785714,0.0,0.0,6.0,0.75,2.0
1,8.0,13.0,0.615385,4.0,0.307692,0.0,0.0,3.0,0.3,5.0
2,14.0,10.0,0.714286,4.0,0.285714,0.0,0.0,3.0,0.428571,4.0
3,11.0,9.0,0.818182,0.0,0.0,0.0,0.0,0.0,0.0,2.0
4,13.0,7.0,0.538462,4.0,0.307692,0.0,0.0,2.0,0.2,6.0


In [321]:
process_pairs(df.loc[40:50])

['why', 'do', 'slav', 'squat']
['will', 'squat', 'make', 'my', 'leg', 'thicker']

['when', 'can', 'i', 'expect', 'my', 'cognizant', 'confirmation', 'mail']
['when', 'can', 'i', 'expect', 'cognizant', 'confirmation', 'mail']

['can', 'i', 'make', '50000', 'a', 'month', 'by', 'day', 'trading']
['can', 'i', 'make', '30000', 'a', 'month', 'by', 'day', 'trading']

['is', 'being', 'a', 'good', 'kid', 'and', 'not', 'being', 'a', 'rebel', 'worth', 'it', 'in', 'the', 'long', 'run']
['is', 'being', 'bored', 'good', 'for', 'a', 'kid']

['what', 'university', 'doe', 'rexnord', 'recruit', 'new', 'grad', 'from', 'what', 'major', 'are', 'they', 'looking', 'for']
['what', 'university', 'doe', 'bg', 'food', 'recruit', 'new', 'grad', 'from', 'what', 'major', 'are', 'they', 'looking', 'for']

['what', 'is', 'the', 'quickest', 'way', 'to', 'increase', 'instagram', 'follower']
['how', 'can', 'we', 'increase', 'our', 'number', 'of', 'instagram', 'follower']

['how', 'did', 'darth', 'vader', 'fought', 'darth

Unnamed: 0,Q1 length,Q2 length,length ratio,N common words,common words percentage,same last word,not_count1,N common non-stop words,common non-stop words percentage,Q1 Q1 length difference
40,4.0,6.0,0.666667,1.0,0.166667,0.0,0.0,1.0,0.25,2.0
41,8.0,7.0,0.875,7.0,0.875,1.0,0.0,5.0,1.0,1.0
42,9.0,9.0,1.0,8.0,0.888889,1.0,0.0,4.0,0.8,0.0
43,16.0,7.0,0.4375,5.0,0.3125,0.0,1.0,2.0,0.333333,9.0
44,14.0,15.0,0.933333,12.0,0.8,1.0,0.0,8.0,0.727273,1.0
45,9.0,9.0,1.0,3.0,0.333333,1.0,0.0,3.0,0.5,0.0
46,11.0,9.0,0.818182,0.0,0.0,0.0,0.0,0.0,0.0,2.0
47,24.0,12.0,0.5,3.0,0.125,0.0,0.0,0.0,0.0,12.0
48,13.0,10.0,0.769231,8.0,0.615385,1.0,0.0,4.0,0.666667,3.0
49,5.0,4.0,0.8,3.0,0.6,1.0,0.0,3.0,1.0,1.0


In [328]:
from sklearn.base import BaseEstimator
class ClfSwitcher(BaseEstimator):
# Create a class named ClfSwitcher which inherits the base class called BaseEstimator from sklearn.
    def __init__(self, estimator = LogisticRegression()):
            self.estimator = estimator # receives an estimator (model) as an input
            
    def fit(self, X, y=None, **kwargs):
            self.estimator.fit(X, y)
            return self
            
    def predict(self, X, y=None):
            return self.estimator.predict(X)
            
    def predict_proba(self, X):
            return self.estimator.predict_proba(X)
            
    def score(self, X, y):
            return self.estimator.score(X, y)

In [350]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

forest_pipe = Pipeline([
    ('process_pairs', FunctionTransformer(process_pairs)),
    ('model', ClfSwitcher())
])
# forest_pipe.fit_transform(df.head()) # This line works before the modeling step
grid_params = [
    {
        'model__estimator': [LogisticRegression(random_state=0)],
        'model__estimator__class_weight': [None, 'balanced']
        },
    {
        'model__estimator': [SVC(random_state=0)],
        'model__estimator__class_weight': [None, 'balanced']
        },
    {
        'model__estimator': [xgb.XGBClassifier(random_state=0)]},
        'model__estimator__n_estimators': [100, 150, 200],
        {
        'model__estimator': [RandomForestClassifier(random_state=0)],
        'model__estimator__n_estimators': [100, 150, 200]
        },
    ]
gs = GridSearchCV(forest_pipe, grid_params, scoring='accuracy')
gs.fit(df.head(20), df.head(20)['is_duplicate'])
print(gs.best_estimator_)
print(gs.best_params_)
y_pred = gs.predict(df.head(10)) # 2022-11-17 11:56 works for predictions

Pipeline(steps=[('process_pairs',
                 FunctionTransformer(func=<function process_pairs at 0x00000291C62F2F70>)),
                ('model',
                 ClfSwitcher(estimator=RandomForestClassifier(n_estimators=150,
                                                              random_state=0)))])
{'model__estimator': RandomForestClassifier(n_estimators=150, random_state=0), 'model__estimator__n_estimators': 150}


In [344]:
print(list(df.head(20)['is_duplicate']))
print(y_pred)

[0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0]
[0 0 0 0 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 0]


# Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc