<a href="https://colab.research.google.com/github/ucheokechukwu/NLP-Project/blob/master/mini_project_V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


Different Models:
1. 'old school' with elaborate feature geneation (word count, common words, etc)/ tfidf tokenization and Naive Bayes (grid search)
2. Deep learning with LSTM and embedding - 2 types of layering
3. deep learning with keras sentence encoder - just modify the output layer

In [3]:
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None) # to display full text without truncations

In [6]:
df = pd.read_csv("data/train.csv")

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [7]:
# quick data overview
display(df.info())
display(df.describe())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


None

Unnamed: 0,id,qid1,qid2,is_duplicate
count,404290.0,404290.0,404290.0,404290.0
mean,202144.5,217243.942418,220955.655337,0.369198
std,116708.614503,157751.700002,159903.182629,0.482588
min,0.0,1.0,2.0,0.0
25%,101072.25,74437.5,74727.0,0.0
50%,202144.5,192182.0,197052.0,0.0
75%,303216.75,346573.5,354692.5,1.0
max,404289.0,537932.0,537933.0,1.0


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in share market in india?,What is the step by step guide to invest in share market?,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Diamond?,What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?,0
2,2,5,6,How can I increase the speed of my internet connection while using a VPN?,How can Internet speed be increased by hacking through DNS?,0
3,3,7,8,Why am I mentally very lonely? How can I solve it?,"Find the remainder when [math]23^{24}[/math] is divided by 24,23?",0
4,4,9,10,"Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?",Which fish would survive in salt water?,0


### Exploration

**Checking for Null and Duplicates**. Via `info()`, already identified 3 NULL fields. Will delete them as they fractional compared to the entire dataset.

In [8]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 404287 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404287 non-null  int64 
 1   qid1          404287 non-null  int64 
 2   qid2          404287 non-null  int64 
 3   question1     404287 non-null  object
 4   question2     404287 non-null  object
 5   is_duplicate  404287 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 21.6+ MB


In [9]:
# checking for duplicates 
df.duplicated().sum()

0

In [10]:
df[df['is_duplicate']==1].sample(n=5)
df[df['is_duplicate']==0].sample(n=5)

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
95164,95164,158777,158778,What is the best way to become an actuary analyst?,What is the best way to become an actuary?,0
92717,92717,6241,155172,How do I overcome my inferiority complex ?,How can I overcome my severe inferiority complex?,0
371094,371094,501700,501701,How can a tuna and Apple diet affect your health negatively?,How healthy is the apple and tuna diet?,0
220410,220410,327501,327502,Are girls attracted to fat guys?,Do girls mind dating fat guys?,0
382825,382825,514713,82649,Do employees at Superior Industries International have a good work-life balance? Does this differ across positions and departments?,Do employees at Reading International have a good work-life balance? Does this differ across positions and departments?,0


#### Creating test_df to hold out as separately

In [11]:
test_df = df.sample(frac=0.2)

filter = df.index.isin(test_df.index.tolist())
train_df = df[~filter]

# checking for intersection
set (train_df.index) & set(test_df.index) # should be empty to pass

set()

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
import string

from nltk.corpus import stopwords
stopwords_eng = stopwords.words('english')

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize # to get other parts of speach

import spacy
nlp = spacy.load("en_core_web_sm")

In [15]:
len(train_df)

323430

In [14]:
# sample text for testing
text = train_df['question1'][:10]
text

0                         What is the step by step guide to invest in share market in india?
3                                         Why am I mentally very lonely? How can I solve it?
4               Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?
5     Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?
6                                                                        Should I buy tiago?
7                                                             How can I be a good geologist?
8                                                            When do you use シ instead of し?
9                               Motorola (company): Can I hack my Charter Motorolla DCX3400?
10                                 Method to find separation of slits using fresnel biprism?
11                                               How do I read and find my YouTube comments?
Name: question1, dtype: object

In [20]:
def preprocessing (documents):
    cleaned_documents = []
    for text in documents:

        # lower case
        text = text.lower()
        # stopwords clearning
        stopwords = stopwords_eng
        text = " ".join([word for word in text.split() if word not in stopwords_eng])

        # removing punctuation
        punctuation = string.punctuation
        text = "".join([char for char in list(text) if char not in string.punctuation])

        # normalizing
        text = nlp(text)

        # lemming & tokenize
        text = [token.lemma_ for token in text]

        
        cleaned_documents.append(text)
        



    
    return cleaned_documents
    

    
preprocessing(text)

[['step', 'step', 'guide', 'invest', 'share', 'market', 'india'],
 ['mentally', 'lonely', 'solve', 'it'],
 ['one',
  'dissolve',
  'water',
  'quikly',
  'sugar',
  'salt',
  'methane',
  'carbon',
  'di',
  'oxide'],
 ['astrology',
  'capricorn',
  'sun',
  'cap',
  'moon',
  'cap',
  'risingwhat',
  'say',
  'I'],
 ['buy', 'tiago'],
 ['good', 'geologist'],
 ['use', 'シ', 'instead', 'し'],
 ['motorola', 'company', 'hack', 'charter', 'motorolla', 'dcx3400'],
 ['method', 'find', 'separation', 'slit', 'use', 'fresnel', 'biprism'],
 ['read', 'find', 'youtube', 'comment']]

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [22]:
# tf-idf 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

# Using Gensim `Doc2Vec`

In [None]:
## Building the Corpus
preprocessed_train_q1 = preprocessing(train_df['question1'])
preprocessed_train_q2 = preprocessing(train_df['question2'])


In [None]:
import gensim
training_texts = preprocessed_train_q1 + preprocessed_train_q2
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
train_corpus = [TaggedDocument(doc, [i]) for i, doc in enumerate(training_texts)]





### Training the model

In [None]:
length_of_text = []
for text in (preprocessed_train_q1 + preprocessed_train_q2):
  length_of_text.append(len(text))

plt.hist(length_of_text)


In [None]:
# instantiate the model
model = Doc2Vec(vector_size=50, min_count=2, epochs=40)
# build a vocabulary
model.build_vocab(train_corpus)
# train the model on the corpus
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)



### Assessing the model

In [None]:
# assess the model on the training data i.e. do a self-similarity check

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# note that the questions have to be tokenized first
doc2vec_similarity = model.n_similarity(preprocessing(train_df['question1'])
                                                      , preprocessing(train_df['question2']))

**Observation**: This % of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. 