## Problem Statement
### Quora is a question answer platform where we can find many answers to our question. Since there are many questions posted on quora hence they are facing duplicacy problem. Quora want to create classification problem where they can classify all the similar type question under merge the answer for those question.

### Import library

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Load the datastet

In [31]:
df=pd.read_csv("D:\DATA SCIENCE Internship with Innomatics\Final_ Project_Quora_Question_Pair_Similarity\data\\train.csv")
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


### Dataset information

In [32]:
df.shape

(404290, 6)

In [33]:
df.columns

Index(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], dtype='object')

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


#### Since there are 400000 rows we are going to work on sample of sample size=10000

In [59]:
data=df.sample(10000, random_state=50, ignore_index=True)
data.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,238973,10001,350466,How do you pronounce the name of the Danish si...,How do you pronounce the Danish last name Damk...,0
1,171471,264932,264933,Why do Christians believe in Jesus and that he...,What is the minimum age for girls to get marri...,0
2,22640,42469,42470,What are the best ways to fake your own death?,What are the worst ways to fake one's own death?,0
3,69202,119440,32667,I feel fear all the time. How can I get rid of...,How can I get rid of fear?,1
4,399748,533051,533052,What's it like to be the assistant of a female...,How is the career growth of an as assistant vi...,0


In [36]:
data.shape

(10000, 6)

#### Drop unwanted column

In [60]:
data.drop(['id','qid1','qid2'], axis=1, inplace=True)
data.columns

Index(['question1', 'question2', 'is_duplicate'], dtype='object')

### Missing values

In [38]:
data.isna().sum()

question1       0
question2       0
is_duplicate    0
dtype: int64

### Duplicate rows

In [39]:
data.duplicated().sum()

0

### Class Imbalance check

In [40]:
data['is_duplicate'].value_counts()

0    6249
1    3751
Name: is_duplicate, dtype: int64

### Text preprocessing

In [41]:
import re
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
lemma=WordNetLemmatizer()
stopwords=stopwords.words('english')

In [42]:
def text_preprocessing(text):
    text=re.sub("[^a-zA-Z]", " ", text)
    text=re.sub("<.*?>", "", text)
    text=str(text).lower()
    text=[word for word in text.split(" ") if word not in stopwords]
    text=[lemma.lemmatize(word) for word in text]
    text= " ".join(text)
    text=re.sub(" +", " ", text)
    return text

data['clean_que1']=data['question1'].apply(text_preprocessing)
data['clean_que2']=data['question2'].apply(text_preprocessing)
data.head()

Unnamed: 0,question1,question2,is_duplicate,clean_que1,clean_que2
0,How do you pronounce the name of the Danish si...,How do you pronounce the Danish last name Damk...,0,pronounce name danish singer english,pronounce danish last name damkj r
1,Why do Christians believe in Jesus and that he...,What is the minimum age for girls to get marri...,0,christian believe jesus magical,minimum age girl get married islamic republic ...
2,What are the best ways to fake your own death?,What are the worst ways to fake one's own death?,0,best way fake death,worst way fake one death
3,I feel fear all the time. How can I get rid of...,How can I get rid of fear?,1,feel fear time get rid,get rid fear
4,What's it like to be the assistant of a female...,How is the career growth of an as assistant vi...,0,like assistant female pornstar,career growth assistant vigilance officer e b


### Merge clean_que1 and clean_que2

In [61]:
data['clean_question']=data['question1']+" "+data['question2']
data.head()

Unnamed: 0,question1,question2,is_duplicate,clean_question
0,How do you pronounce the name of the Danish si...,How do you pronounce the Danish last name Damk...,0,How do you pronounce the name of the Danish si...
1,Why do Christians believe in Jesus and that he...,What is the minimum age for girls to get marri...,0,Why do Christians believe in Jesus and that he...
2,What are the best ways to fake your own death?,What are the worst ways to fake one's own death?,0,What are the best ways to fake your own death?...
3,I feel fear all the time. How can I get rid of...,How can I get rid of fear?,1,I feel fear all the time. How can I get rid of...
4,What's it like to be the assistant of a female...,How is the career growth of an as assistant vi...,0,What's it like to be the assistant of a female...


### Input and output

In [62]:
X=data['clean_question']
y=data['is_duplicate']

### Train test split

In [63]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y, test_size=0.2, random_state=0)

### Text to numeric convesion using word embedding

In [64]:
from keras.preprocessing.text import Tokenizer

tokenize=Tokenizer(num_words=100000)
tokenize.fit_on_texts(x_train)

In [65]:
#text to sequence
x_train_seq=tokenize.texts_to_sequences(x_train)
x_test_seq=tokenize.texts_to_sequences(x_test)
print(x_train[0])
print(x_train_seq[0])

How do you pronounce the name of the Danish singer "MØ" in English? How do you pronounce the Danish last name Damkjær?
[2, 3, 18, 47, 1, 636, 7886, 218, 28, 39, 118, 60, 87, 2, 10, 1, 636, 69, 47, 355]


### Use sequence padding so that every vector size should be same. In other words every sentence should be of same words

In [66]:
df_copy=data.copy()
df_copy['num_words']=data['clean_question'].apply(lambda x : len(x.split(" ")))
df_copy['num_words'].max()

104

In [67]:
from keras.utils import pad_sequences
max_sentence_len=100
x_train_seq_pad=pad_sequences(x_train_seq, padding='post', maxlen=max_sentence_len)
x_test_seq_pad=pad_sequences(x_test_seq, padding='post', maxlen=max_sentence_len)

In [68]:
print(x_train[0])
print(x_train_seq[0])
print(x_train_seq_pad[0])

How do you pronounce the name of the Danish singer "MØ" in English? How do you pronounce the Danish last name Damkjær?
[2, 3, 18, 47, 1, 636, 7886, 218, 28, 39, 118, 60, 87, 2, 10, 1, 636, 69, 47, 355]
[   2    3   18   47    1  636 7886  218   28   39  118   60   87    2
   10    1  636   69   47  355    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


### Processing output data

In [69]:
from keras.utils import to_categorical
y_train_class=to_categorical(y_train, num_classes=2)
y_test_class=to_categorical(y_test, num_classes=2)
print(y_train_class[0], y_test_class[0])

[1. 0.] [0. 1.]


### Prepearing data for RNN

#### input data to 3 dimensions

In [70]:
x_train_seq_pad_rnn=np.array(x_train_seq_pad).reshape((x_train_seq_pad.shape[0]), x_train_seq_pad.shape[1], 1)
x_test_seq_pad_rnn=np.array(x_test_seq_pad).reshape((x_test_seq_pad.shape[0]), x_test_seq_pad.shape[1], 1)
print(x_train_seq_pad_rnn.shape)
print(x_test_seq_pad_rnn.shape)

(8000, 100, 1)
(2000, 100, 1)


### Build RNN model

In [71]:
import keras
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN, Activation
from tensorflow.keras.optimizers import Adam

In [72]:
model=Sequential()
model.add(SimpleRNN(32, input_shape=(100,1)))
model.add(Dense((2)))
model.add(Activation("softmax"))
model.compile(optimizer=Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_2 (SimpleRNN)    (None, 32)                1088      
                                                                 
 dense_2 (Dense)             (None, 2)                 66        
                                                                 
 activation_2 (Activation)   (None, 2)                 0         
                                                                 
Total params: 1,154
Trainable params: 1,154
Non-trainable params: 0
_________________________________________________________________


In [73]:
model.fit(x_train_seq_pad, y_train_class, epochs=10, batch_size=32, validation_data=(x_test_seq_pad, y_test_class))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x22cef6f7d90>

In [74]:
y_pred=np.argmax(model.predict(x_test_seq_pad), axis=-1)



In [75]:
from sklearn.metrics import classification_report
print(classification_report(y_pred, y_test))

              precision    recall  f1-score   support

           0       0.69      0.64      0.67      1363
           1       0.34      0.39      0.36       637

    accuracy                           0.56      2000
   macro avg       0.52      0.52      0.51      2000
weighted avg       0.58      0.56      0.57      2000



In [76]:
y_pred

array([1, 0, 1, ..., 1, 0, 1], dtype=int64)