## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [20]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [21]:
df = pd.read_csv("/Users/tativalentine/Documents/GitHub/mini-project-V-1/train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [22]:
import numpy as np

# Load the data into a pandas DataFrame
train_data = pd.read_csv("train.csv")

# Shuffle the data to randomly sample data for testing
train_data = train_data.sample(frac=1).reset_index(drop=True)

# Calculate the number of rows to use for testing
test_size = int(train_data.shape[0] * 0.2)

# Split the data into testing and training sets
test_data = train_data.iloc[:test_size,:]
train_data = train_data.iloc[test_size:,:]


### Exploration

In [23]:
# Get the shape of the data
print("Number of rows: ", train_data.shape[0])
print("Number of columns: ", train_data.shape[1])

Number of rows:  323432
Number of columns:  6


In [24]:
train_data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
80858,2835,5624,5625,What are some cultural faux pas in space?,What are some cultural faux pas in China?,0
80859,44539,79917,79918,What is the fastest way to get rid of a canker...,What's the best way to get rid of a canker sore?,1
80860,352293,481213,481214,Can you explain global warming in layman terms?,What's the explanation of the connection betwe...,1
80861,170408,263482,263483,How do I approach my parents about love?,How do I convince my parents for my love?,0
80862,362044,491931,491932,"If the Star Trek Universe were real, could the...",Star Trek (creative franchise): Why are the br...,0


In [25]:
# Get a list of the column names
print("Column names: ", train_data.columns)

Column names:  Index(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], dtype='object')


In [26]:
# Get the first 5 rows of the data
print("First 5 rows: \n", train_data.head())

First 5 rows: 
            id    qid1    qid2  \
80858    2835    5624    5625   
80859   44539   79917   79918   
80860  352293  481213  481214   
80861  170408  263482  263483   
80862  362044  491931  491932   

                                               question1  \
80858          What are some cultural faux pas in space?   
80859  What is the fastest way to get rid of a canker...   
80860    Can you explain global warming in layman terms?   
80861           How do I approach my parents about love?   
80862  If the Star Trek Universe were real, could the...   

                                               question2  is_duplicate  
80858          What are some cultural faux pas in China?             0  
80859   What's the best way to get rid of a canker sore?             1  
80860  What's the explanation of the connection betwe...             1  
80861          How do I convince my parents for my love?             0  
80862  Star Trek (creative franchise): Why are the br...   

In [27]:
# Get summary statistics for each column
print("Summary statistics: \n", train_data.describe())

Summary statistics: 
                   id           qid1           qid2   is_duplicate
count  323432.000000  323432.000000  323432.000000  323432.000000
mean   202004.845266  217092.798724  220718.269976       0.369926
std    116661.324803  157631.435772  159790.502162       0.482785
min         0.000000       1.000000       2.000000       0.000000
25%    100998.750000   74426.750000   74605.000000       0.000000
50%    201959.500000  191980.000000  196783.500000       0.000000
75%    302937.750000  346253.250000  354212.500000       1.000000
max    404288.000000  537930.000000  537931.000000       1.000000


In [28]:
# Check for missing values
print("Missing values: \n", train_data.isna().sum())

Missing values: 
 id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [29]:
# Clean and preprocess the data

df['question1'] = df['question1'].astype(str)
df['question2'] = df['question2'].astype(str)

stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    tokens = [ps.stem(word) for word in tokens]
    return " ".join(tokens)

df['question1'] = df['question1'].apply(preprocess_text)
df['question2'] = df['question2'].apply(preprocess_text)

# Convert text data into numerical representations
cv = CountVectorizer()
X = cv.fit_transform(df['question1'] + df['question2'])
y = df['is_duplicate']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
import numpy as np
import string

In [31]:

# Preprocess the text
def preprocess_text(text):
    # Remove punctuation and lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    return text

# Preprocess both questions
df['question1'] = df['question1'].apply(preprocess_text)
df['question2'] = df['question2'].apply(preprocess_text)


In [32]:
# Tf-idf
tfidf = TfidfVectorizer()
question1_tfidf = tfidf.fit_transform(df['question1'])
question2_tfidf = tfidf.transform(df['question2'])


In [33]:
from gensim.models import Word2Vec

questions = df['question1'].tolist() + df['question2'].tolist()
model = Word2Vec(questions, vector_size=150, window=10, min_count=2, workers=10)

question1_word2vec = []
question2_word2vec = []
for question in df['question1']:
    word_vectors = [model.wv[word] for word in question.split() if word in model.wv.vocab]
    if word_vectors:
        question1_word2vec.append(np.mean(word_vectors, axis=0))
    else:
        question1_word2vec.append(np.zeros(150))
        
for question in df['question2']:
    word_vectors = [model.wv[word] for word in question.split() if word in model.wv.vocab]
    if word_vectors:
        question2_word2vec.append(np.mean(word_vectors, axis=0))
    else:
        question2_word2vec.append(np.zeros(150))
        
question1_word2vec = np.array(question1_word2vec)
question2_word2vec = np.array(question2_word2vec)


AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

In [34]:
# Word count
question1_word_count = df['question1'].apply(lambda x: len(x.split()))
question2_word_count = df['question2'].apply(lambda x: len(x.split()))

In [35]:
# Number of the same words in both questions
def same_words(question1, question2):
    return len(set(question1.split()).intersection(question2.split()))

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

# Logistic Regression


In [36]:
# Logistic Regression

# Train a logistic regression model
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Evaluate the model on the testing data
y_pred = clf.predict(X_test)
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("Precision: ", precision_score(y_test, y_pred))
print("Recall: ", recall_score(y_test, y_pred))
print("F1 Score: ", f1_score(y_test, y_pred))

# Save the best-performing model
import joblib
joblib.dump(clf, 'duplicate_question_classifier.joblib')

Accuracy:  0.7625095847040491
Precision:  0.723071863180398
Recall:  0.5851938113458659
F1 Score:  0.6468673568840912


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


['duplicate_question_classifier.joblib']

# XGBOOST

In [37]:
# Train the XGBoost classifier
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's accuracy
acc = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (acc * 100.0))

Accuracy: 71.84%


# LSTM

In [51]:
pip install tensorflow


[31mERROR: Could not find a version that satisfies the requirement tensorflow (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for tensorflow[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
