## Identifying Duplicate Questions

Over 100 million people visit *Quora* every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. *Quora* uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.

Follow the steps outlined below to build the appropriate classifier model. 

Steps:
- Download or load the data
- Exploration (EDA)
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 

### Step 1: Loading the data

In [1]:
# Import basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

In [2]:
# Load "Quora" dataset
df = pd.read_csv("/Users/rafaelaqueiroz/Mini-Project-V/train.csv")
df.head(10)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

### Step 2: Exploration

In [3]:
# Check the number of rows and columns of the dataset
df.shape

(404290, 6)

The dataset has 6 columns and 404,290 rows in total.

In [4]:
# Check for missing or NaN values
df.isnull() # It returns a boolean df indicating the presence or absence of missing values in each cell of it

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
404285,False,False,False,False,False,False
404286,False,False,False,False,False,False
404287,False,False,False,False,False,False
404288,False,False,False,False,False,False


In [5]:
# Check for the total missing or NaN values
df.isnull().sum() # It sums up the number of 'True' values for each column

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [6]:
# Investigate the 3 rows that have the missing values
df[df.isnull().any(axis=1)]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
105780,105780,174363,174364,How can I develop android app?,,0
201841,201841,303951,174364,How can I create an Android app?,,0
363362,363362,493340,493341,,My Chinese name is Haichao Yu. What English na...,0


Even though the first 2 rows have questions that are not marked as duplicates, they are semantically similar and produce the same pragmatic meaning. This is an important point to consider when building a model to identify duplicate questions, as it highlights the need to use NLP techniques that can capture the semantic similarity between questions, beyond just comparing the text directly.

Also, as we only have 3 NaN values, we decided to drop them as this represents a small percentage of duplicates in comparison with our total data. However, if this would represent a bigger percentage, the act of dropping rows would be needed to consider with caution as this could introduce bias to our dataset and affect the overall performance of our model. 

In [7]:
# Drop the missing values from the dataset
df.dropna(subset=['question1', 'question2'], inplace=True)

In [8]:
# Check for duplicate values before moving forward to our cleaning and preprocessing data
df['is_duplicate'].value_counts() # It counts de number of duplicate and non-duplicate questions in the dataset

0    255024
1    149263
Name: is_duplicate, dtype: int64

In [62]:
# # Let's visualize the data to see its proportionality
# # Count the number of occurrences of each category
# counts = df.groupby(['question1', 'question2']).size().reset_index(name='count')

# # Calculate the total count
# total_count = counts['count'].sum()

# # Calculate the percentage of each category
# counts['percentage'] = (counts['count'] / total_count) * 100

# # Display the percentages in a pie chart
# plt.pie(counts['percentage'], labels=counts['question1'] + ' ' + counts['question2'], autopct='%1.1f%%')
# plt.title('Categorical Features')
# plt.show()

There are 255,024 non-duplicate questions and 149,263 duplicate questions in the dataframe. This is an important factor to be considered before making our model as the count of non-duplicate and duplicate questions can help us to understand the nature of the dataset and, likewise, to give us some information on how to approach a model to identify duplicate questions.

In our case, the number of duplicate questions (given by the Boolean 1) is much smaller than the number of non-duplicate questions. Then this could indicate that the dataset is imbalanced, which may affect the performance of the model afterwards. Thus, we might need to employ techniques like undersampling or oversampling to balance the dataset.

As this can pose a challenge when building a model - the model may be biased towards classifying questions as non-duplicates, leading to a poor performance on identifying duplicates -, we decided to address this balance with the use of undersampling.

In [10]:
# Select randomly a subset of non-duplicate questions to match the number of duplicate questions
duplicates = df[df['is_duplicate'] == 1] # Separate the duplicate and non-duplicate questions
non_duplicates = df[df['is_duplicate'] == 0]
num_duplicates = len(duplicates) # Get the number of duplicate questions
non_duplicates_sampled = non_duplicates.sample(num_duplicates) # Select a subset of non-duplicate questions

# Combine the sampled non-duplicate questions with the original duplicate questions
balanced_df = pd.concat([duplicates, non_duplicates_sampled], axis=0) 

# Shuffle the dataset to ensure that the duplicate and non-duplicate questions are mixed
balanced_df = balanced_df.sample(frac=1).reset_index(drop=True)
balanced_df

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,6730,13174,13175,What are shot narrative paragraphs? What are s...,What are some examples of a narrative paragraph?,0
1,177759,273176,273177,Which is a better web development framework: D...,Which is better and why: PHP frameworks( Code ...,0
2,28446,52751,52752,What lessons can we learn from Adolf Hitler?,What are the lessons we can learn from Adolf H...,1
3,27196,50540,50541,Why aren’t cats mentioned in the Bible?,Why aren't books being added to the Bible in m...,0
4,369220,499635,401188,"The Jews, Christians and Muslims all worship t...","Do Jews, Christians and Muslims all worship th...",1
...,...,...,...,...,...,...
298521,130598,209556,54732,What is the best business to start in small ci...,What is the best business to start in a villag...,1
298522,366980,497240,497241,Stanford Football: How is it different to play...,Stanford Football: What was it like to have Da...,0
298523,51396,91167,91168,How do you choose an unresponsive yoyo?,Which yoyo is André Boulay using in his tutori...,0
298524,59264,103827,103828,Which is the best song that one can listen to ...,What are some good songs to listen to when dep...,1


In [11]:
# Now, let's check if our dataset is more balanced
balanced_df['is_duplicate'].value_counts()

0    149263
1    149263
Name: is_duplicate, dtype: int64

### Step 3: Cleaning

- Removing punctuation
- Tokenization
- Cleaning stopwords
- Normalizing
- Stemming or Lemmitization

In order to clean (more) and preprocess the text data, we are going to apply different techniques, such as, converting it to lowercase, removing stop words and punctuation, and stemming or lemmatization of the the words.

#### Removing punctuation

In [12]:
# Import more libraries
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [13]:
# Define a function to remove punctuation in our "question1" and "question2" columns
def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

balanced_df = balanced_df.assign( # Assigning the new 2 columns in the dataframe
    question1_cleaned=balanced_df["question1"].apply(remove_punct),
    question2_cleaned=balanced_df["question2"].apply(remove_punct)
)

balanced_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,question1_cleaned,question2_cleaned
0,6730,13174,13175,What are shot narrative paragraphs? What are s...,What are some examples of a narrative paragraph?,0,What are shot narrative paragraphs What are so...,What are some examples of a narrative paragraph
1,177759,273176,273177,Which is a better web development framework: D...,Which is better and why: PHP frameworks( Code ...,0,Which is a better web development framework Dj...,Which is better and why PHP frameworks Code ig...
2,28446,52751,52752,What lessons can we learn from Adolf Hitler?,What are the lessons we can learn from Adolf H...,1,What lessons can we learn from Adolf Hitler,What are the lessons we can learn from Adolf H...
3,27196,50540,50541,Why aren’t cats mentioned in the Bible?,Why aren't books being added to the Bible in m...,0,Why aren’t cats mentioned in the Bible,Why arent books being added to the Bible in mo...
4,369220,499635,401188,"The Jews, Christians and Muslims all worship t...","Do Jews, Christians and Muslims all worship th...",1,The Jews Christians and Muslims all worship th...,Do Jews Christians and Muslims all worship the...


In [14]:
# Drop the columns that are not going to be used anymore
balanced_df = balanced_df.drop(['id', 'qid1', 'qid2', 'question1', 'question2'], axis=1)
balanced_df.tail()

Unnamed: 0,is_duplicate,question1_cleaned,question2_cleaned
298521,1,What is the best business to start in small ci...,What is the best business to start in a villag...
298522,0,Stanford Football How is it different to play ...,Stanford Football What was it like to have Dav...
298523,0,How do you choose an unresponsive yoyo,Which yoyo is André Boulay using in his tutori...
298524,1,Which is the best song that one can listen to ...,What are some good songs to listen to when dep...
298525,1,What could be the effect of GST bill on Indian...,What is the new GST bill and how will it affec...


#### Tokenization and applying lower case

In [15]:
# Import regular expression library
import re

In [16]:
# Define a function to split our sentences into a list of words
def tokenize(text):
    tokens = text.split()
    return tokens

balanced_df['question_1_tokenized'] = balanced_df['question1_cleaned'].apply(lambda x: tokenize(x.lower()))
balanced_df['question_2_tokenized'] = balanced_df['question2_cleaned'].apply(lambda x: tokenize(x.lower()))
balanced_df.tail(2)

Unnamed: 0,is_duplicate,question1_cleaned,question2_cleaned,question_1_tokenized,question_2_tokenized
298524,1,Which is the best song that one can listen to ...,What are some good songs to listen to when dep...,"[which, is, the, best, song, that, one, can, l...","[what, are, some, good, songs, to, listen, to,..."
298525,1,What could be the effect of GST bill on Indian...,What is the new GST bill and how will it affec...,"[what, could, be, the, effect, of, gst, bill, ...","[what, is, the, new, gst, bill, and, how, will..."


In [17]:
# Drop the previous columns
balanced_df = balanced_df.drop(['question1_cleaned', 'question2_cleaned'], axis=1)
balanced_df.tail(2)

Unnamed: 0,is_duplicate,question_1_tokenized,question_2_tokenized
298524,1,"[which, is, the, best, song, that, one, can, l...","[what, are, some, good, songs, to, listen, to,..."
298525,1,"[what, could, be, the, effect, of, gst, bill, ...","[what, is, the, new, gst, bill, and, how, will..."


#### Removing the stopwords

In [18]:
# Import the NLTK package
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# View the stopwords
ENGstopwords = stopwords.words('english')
ENGstopwords[0:25]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers']

In [19]:
# Define a function to remove all stopwords
def remove_stopwords(tokenized_text):    
    text = [word for word in tokenized_text if word not in ENGstopwords]
    return text

balanced_df['question1_non_stop'] = balanced_df['question_1_tokenized'].apply(lambda x: remove_stopwords(x))
balanced_df['question2_non_stop'] = balanced_df['question_2_tokenized'].apply(lambda x: remove_stopwords(x))
balanced_df.tail(2)

Unnamed: 0,is_duplicate,question_1_tokenized,question_2_tokenized,question1_non_stop,question2_non_stop
298524,1,"[which, is, the, best, song, that, one, can, l...","[what, are, some, good, songs, to, listen, to,...","[best, song, one, listen, sad, depressed]","[good, songs, listen, depressed]"
298525,1,"[what, could, be, the, effect, of, gst, bill, ...","[what, is, the, new, gst, bill, and, how, will...","[could, effect, gst, bill, indian, economy]","[new, gst, bill, affect, us]"


In [20]:
# Drop the previous columns
balanced_df = balanced_df.drop(['question_1_tokenized', 'question_2_tokenized'], axis=1)
balanced_df.tail(2)

Unnamed: 0,is_duplicate,question1_non_stop,question2_non_stop
298524,1,"[best, song, one, listen, sad, depressed]","[good, songs, listen, depressed]"
298525,1,"[could, effect, gst, bill, indian, economy]","[new, gst, bill, affect, us]"


### Stemming

In [36]:
# Import modules 
nltk.download('punkt')
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [37]:
# Create a stemmer object
ps = PorterStemmer() # It will remove pre-defined stems

# Define a function to stem the text
def stemmed_text(words):
    stemmed_words = []
    for word in words:
        stemmed_words.append(ps.stem(word))
    return stemmed_words

In [38]:
# Call up the function that applies stemmed_text to our columns in the data frame
balanced_df['question1_stem'] = balanced_df['question1_non_stop'].apply(lambda x: stemmed_text(x))
balanced_df['question2_stem'] = balanced_df['question2_non_stop'].apply(lambda x: stemmed_text(x))
balanced_df.tail(2)

Unnamed: 0,is_duplicate,question1_non_stop,question2_non_stop,question1_stem,question2_stem
298524,1,"[best, song, one, listen, sad, depressed]","[good, songs, listen, depressed]","[best, song, one, listen, sad, depress]","[good, song, listen, depress]"
298525,1,"[could, effect, gst, bill, indian, economy]","[new, gst, bill, affect, us]","[could, effect, gst, bill, indian, economi]","[new, gst, bill, affect, us]"


Before dropping the previous columns, we are going to make a comparative of the modification results with the lemmitization technique.

#### Lemmatization

In [41]:
# Importing some modules 
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/rafaelaqueiroz/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [43]:
# Create a lemmatizer object
lemmatizer = WordNetLemmatizer() 

# Create a function to apply lemmitization into our words
def lemmitization(words):
    lemmitization_words = []
    for word in words:
        lemmitization_list = ' '.join([lemmatizer.lemmatize(word) for word in words])
        lemmitization_words.append(lemmitization_list)
    return lemmitization_words 

In [45]:
# Call up the function that applies lemmitization to our columns in the data frame
balanced_df['question1_lemm'] = balanced_df['question1_non_stop'].apply(lambda x: stemmed_text(x))
balanced_df['question2_lemm'] = balanced_df['question2_non_stop'].apply(lambda x: stemmed_text(x))
balanced_df.tail(5)

Unnamed: 0,is_duplicate,question1_non_stop,question2_non_stop,question1_stem,question2_stem,question1_lemm,question2_lemm
298521,1,"[best, business, start, small, cities]","[best, business, start, village, small, city]","[best, busi, start, small, citi]","[best, busi, start, villag, small, citi]","[best, busi, start, small, citi]","[best, busi, start, villag, small, citi]"
298522,0,"[stanford, football, different, play, david, s...","[stanford, football, like, david, shaw, teammate]","[stanford, footbal, differ, play, david, shaw,...","[stanford, footbal, like, david, shaw, teammat]","[stanford, footbal, differ, play, david, shaw,...","[stanford, footbal, like, david, shaw, teammat]"
298523,0,"[choose, unresponsive, yoyo]","[yoyo, andré, boulay, using, tutorial, videos]","[choos, unrespons, yoyo]","[yoyo, andré, boulay, use, tutori, video]","[choos, unrespons, yoyo]","[yoyo, andré, boulay, use, tutori, video]"
298524,1,"[best, song, one, listen, sad, depressed]","[good, songs, listen, depressed]","[best, song, one, listen, sad, depress]","[good, song, listen, depress]","[best, song, one, listen, sad, depress]","[good, song, listen, depress]"
298525,1,"[could, effect, gst, bill, indian, economy]","[new, gst, bill, affect, us]","[could, effect, gst, bill, indian, economi]","[new, gst, bill, affect, us]","[could, effect, gst, bill, indian, economi]","[new, gst, bill, affect, us]"


In [64]:
# Drop previous columns and staying with the lemmatization technique
balanced_df = balanced_df.drop(['question1_non_stop', 'question2_non_stop', 'question1_stem', 'question2_stem'], axis=1)
balanced_df.tail(2)

Unnamed: 0,is_duplicate,question1_lemm,question2_lemm
298524,1,"[best, song, one, listen, sad, depress]","[good, song, listen, depress]"
298525,1,"[could, effect, gst, bill, indian, economi]","[new, gst, bill, affect, us]"


### Step 4: Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

At this step, we are going to extract relevant features from the preprocessed text data. Some useful features for this task might include the length of the questions, the number of shared words between questions, and the cosine similarity between their vector representations.

Also, before starting off this step, we are creating a *document term matrix* to help us vectorize the words.

In [57]:
# Import some libraries
from sklearn.feature_extraction.text import CountVectorizer # This is for Bag-of-Words application
vect = CountVectorizer()

In [58]:
# Create a function to vectorize the words 
def create_doc_term_matrix(text, vectorizer):
    doc_term_matrix = vectorizer.fit_transform(text)
    return pd.DataFrame(doc_term_matrix.toarray(), columns = vectorizer.get_feature_names_out()) # It returns a df with the results of the vectorizer

In [61]:
# Call up the vectorizer function to apply in our df
vect = CountVectorizer()
text = pd.concat([balanced_df['question1_lemm'], balanced_df['question2_lemm']]).tolist()
create_doc_term_matrix(text, vect)

AttributeError: 'list' object has no attribute 'lower'

#### Text Frequency - Inverse Document Frequency (TF-IDF)

In [None]:
# # Finally, let's visualize the distribution of the dataset
# fig, ax = plt.subplots(figsize=(10, 5))

# # Create a scatter plot of question1 vs. question2, with "is_duplicate" as the color
# scatter = ax.scatter(balanced_df["question1"].count(), balanced_df["question2"].count(), c=balanced_df["is_duplicate"])

# # Add a colorbar to show the mapping between color and "is_duplicate" values
# cbar = plt.colorbar(scatter)
# cbar.ax.set_ylabel("is_duplicate", rotation=270)

# # Set labels and title for the plot
# ax.set_xlabel("Question 1")
# ax.set_ylabel("Question 2")
# ax.set_title("Question 1 vs. Question 2, with is_duplicate as color")

# plt.show() # show the plot

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

Choose a machine learning algorithm that is appropriate for this task, such as logistic regression, decision tree, random forest, or support vector machine. Split the data into training and testing sets, and train the model on the training data.

### Model Evaluation

Evaluate the model: Evaluate the performance of the model on the testing data using metrics such as accuracy, precision, recall, and F1-score. You can also use techniques such as cross-validation to get a more accurate estimate of the model's performance.

Fine-tune the model: If the model's performance is not satisfactory, fine-tune it by adjusting the hyperparameters or trying different algorithms. You can also try using deep learning techniques such as neural networks or convolutional neural networks to improve the model's performance.

### Model Deployment

Deploy the model: Once you are satisfied with the model's performance, deploy it to automatically identify and label duplicate questions on Quora.