In the solution below, we will be using word embedding methods like GloVe and deep learning methods like LSTM to predict whether the two given questions are duplicates. We will be calculating the similarity between the two texts. Scroll down to understand what methods are used and how they are used to solve this NLP problem. The text preprocessing steps given below are applied to all the NLP problems and are integral part of any NLP problem. If any questions, please comment below.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from nltk.tokenize import word_tokenize
import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Read the train and test dataset

In [None]:
df = pd.read_csv('../input/quora-question-pairs/train.csv.zip')

In [None]:
df.head()

In the above dataset, we are given question pairs and a 'is_duplicate' column which tells us whether the question pairs are duplicate of each other or not. We'll train our model over this train dataset.

In [None]:
test_data = pd.read_csv('../input/quora-question-pairs/test.csv.zip')

In [None]:
test_data.head()

This is our test data. We'll calculate the similarity between these question pairs to get the idea whether the are duplicates or not.

In [None]:
X_train = df.iloc[:,:5].values
Y_train = df.iloc[:,5:].values

In [None]:
X_testq1 = test_data.iloc[:400001,1:2].values
X_testq2 = test_data.iloc[:400001, 2:].values

In [None]:
s1 = X_train[:,3]
s2 = X_train[:,4]

# Text Pre-processing
Text Preprocessing is the first step in the pipeline of Natural Language Processing (NLP), with potential impact in its final process. Text Preprocessing is the process of bringing the text into the form that is predictable and analyzable for a specific task. A task is the combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of task. The main objective of text preprocessing is to break the text into a form that machine learning algorithms can digest. In this kernel, we will perform the task of text preprocessing on a corpus of quora question pairs and then use the filtered dataset to analyse the similarity between the question pairs

## Tokenization
Tokenization is defined as a process to split the text into smaller units, i.e., tokens, perhaps at the same time throwing away certain characters, such as punctuation. Tokens could be words, numbers, symbols, n-grams, and characters. N-grams is a combination of n words or characters together. Tokenization does this task by locating word boundaries.

Input: Friends, Romans, Countrymen, lend me your ears

Output: ['Friends','Romans','Countrymen','lend','me','your','ears']             

Most widely used tokenization process is white space tokenization. In this process, the entire text is split into words by splitting them from whitespaces.

We have defined a function that will tokenuze the train and test questions in the dataset


In [None]:
def tokenize(s):
    tokens = []
    tokens = [word_tokenize(str(sentence)) for sentence in s]

    rm1 = []
    for w in tokens:
        sm = re.sub('[^A-Za-z]',' ', str(w))
        x = re.split("\s", sm)
        rm1.append(x)
        
    return rm1

## Lowercasing
This is the simplest technique of text preprocessing which consists of lowercasing each single token of the input text.. It helps in dealing with sparsity issues in the dataset. For example, a text is having mixed-case occurrences of the token ‘Canada’, i.e., at some places token ‘canada’ is and in other ‘Canada’ is used. To eliminate this variation, so that it does not cause further problems, we use lowercasing technique to eliminate the sparsity issue and reduce the vocabulary size.

Despite its excellence in reducing sparsity issue and vocabulary size, it sometimes impacts system’s performance by increasing ambiguity. For example, ‘Apple is the best company for smartphones ‘. Here when we perform lowercasing, Apple is transformed into apple and this creates ambiguity as the model is unaware that apple is a company or a fruit and there are higher chances that it may interpret apple as fruit. 


In [None]:
def lower_case(s):
    #Removing whitespaces    
    for sent in s:
        while '' in sent:
            sent.remove('')

    # Lowercasing
    low = []
    for i in s:
        i = [x.lower() for x in i]
        low.append(i)
        
    return low
    

## Normalization
Normalization is the process of converting the token into its basic form (morpheme). Inflection is removed from the token to get the base form of the word. It helps in reducing the number of unique tokens and redundancy in data. It reduces the data dimensionality and removes variation of a word from text.
There are two techniques to perform normalization. They are Stemming and Lemmatization.

### Stemming
Stemming is the elementary rule-based process of removal of inflectional forms from a token. The token is converted into its root form. For example, the word ‘troubled’ is converted into ‘trouble’ after performing stemming. 

There are different algorithms for stemming but the most common algorithm, which is also known to be empirically effective for English, is Porter’s Algorithm. Porter’s Algorithm consists of 5 phases of word reductions applied sequentially.

Since stemming follows crude heuristic approach that chops off the end of the tokens in the hope of correctly transforming into its root form, it sometimes may generate non-meaningful terms. For example, it may convert the token ‘increase’ into ‘increas’, causing the token to lose its meaning.

### Lemmatization
Lemmatization is similar to stemming, difference being that lemmatization refers to doing things properly with use of vocabulary and morphological analysis of words, aiming to remove inflections from the word and to return base or dictionary form of that word, also known as lemma. It does full morphological analysis of the word to accurately identify the lemma for each word. It may use a dictionary such as Wordnet for mapping or some other rule-based approaches.

We have used the Lemmatization to perform normalizaion. You can use Stemming as well since it has been found that the results yielded by Lemmatization and Stemming are not much different.

In [None]:
def lemmatize(s):
    lemma = []
    wnl = WordNetLemmatizer()
    for doc in s:
        tokens = [wnl.lemmatize(w) for w in doc]
        lemma.append(tokens)

    # Removing Stopwords
    filter_words = []
    Stopwords = set(stopwords.words('english'))

    #ab = spell('nd')
    for sent in lemma:
        tokens = [w for w in sent if w not in Stopwords]
        filter_words.append(tokens)

    space = ' ' 
    sentences = []
    for sentence in filter_words:
        sentences.append(space.join(sentence))
        
    return sentences

## Stopwords
In the above function, you can see stopwords are being removed from the questions. What are stopwords and why are they removed?
Stop-words are commonly used words in a language. Examples are ‘a’, ’an’, ’the’, ’is’, ’what’ etc. Stop-words are removed from the text so that we can concentrate on more important words and prevent stop-words from being analyzed. If we search ‘what is text preprocessing’, we want to focus more on ‘text preprocessing’ rather than ‘what is’. 

In [None]:
# sent1 = tokenize(s1)
# sent2 = tokenize(s2)
# q1 = lower_case(sent1)
# q2 = lower_case(sent2)
# listq1 = lemmatize(q1)
# listq2 = lemmatize(q2)
# sent1_t = tokenize(X_test_q1)
# sent2_t = tokenize(X_test_q2)
# q1_t = lower_case(sent1_t)
# q2_t = lower_case(sent2_t)
# listq1 = lemmatize(q1_t)
# listq2 = lemmatize(q2_t)

# Keras text preprocessing

When approaching a NLP problem, either you can perform all the above mentioned steps in that order or you can use Keras' Tokenizer class to perform tokenization. In this project, I have tried out Keras' Tokenizer class and it also works pretty good

MAX_NB_WORDS is a constant which indicates the maximum number of words that should be present.
Next, we fit out Tokenizer on all the questions in column 'question1' and 'question2'.

In [None]:
MAX_NB_WORDS = 200000
tokenizer = Tokenizer(num_words = MAX_NB_WORDS)
tokenizer.fit_on_texts(list(df['question1'].values.astype(str))+list(df['question2'].values.astype(str)))


## Padding and sequencing
Now, we convert all the questions in column 'question1' and 'question2', of both train and test set, into sequences, i.e, in the form of numbers since machine can only process numbers and not texts. 
We define the maximum length of each question and the questions which contain less than the required length are padded with zeros to make the length of the sentence equal to the mentioned length.

For example, maxlen = 5

sent1 = ['I', 'love','apples']

After sequencing,

sent1 = [1,2,3]

After padding,

sent1 = [1,2,3,0,0]

In [None]:
# X_train_q1 = tokenizer.texts_to_sequences(np.array(listq1))
X_train_q1 = tokenizer.texts_to_sequences(df['question1'].values.astype(str))
X_train_q1 = pad_sequences(X_train_q1, maxlen = 30, padding='post')

# X_train_q2 = tokenizer.texts_to_sequences(np.array(listq2))
X_train_q2 = tokenizer.texts_to_sequences(df['question2'].values.astype(str))
X_train_q2 = pad_sequences(X_train_q2, maxlen = 30, padding='post')


In [None]:
X_test_q1 = tokenizer.texts_to_sequences(X_testq1.ravel())
X_test_q1 = pad_sequences(X_test_q1,maxlen = 30, padding='post')

X_test_q2 = tokenizer.texts_to_sequences(X_testq2.astype(str).ravel())
X_test_q2 = pad_sequences(X_test_q2, maxlen = 30, padding='post')

In [None]:
word_index = tokenizer.word_index

## Loading Glove word embedding

GloVe (Global Vectors) is a model for distributed representations. The model is an unsupervised learning algorithm for obtaining vector representation for words. This is taken care of by mapping words into meaningful space where the distance between the words is related to semantic similarity. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [None]:
embedding_index = {}
with open('../input/glove-global-vectors-for-word-representation/glove.6B.200d.txt','r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:], 'float32')
        embedding_index[word] = vectors
    f.close()

In [None]:
embedding_matrix = np.random.random((len(word_index)+1, 200))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

# LSTM coding begins !!

Long Short Term Memory is a variation of RNN which is used to eliminate the **Vanishing Gradients Problem**. It is much more powerful and complex than other variations of RNN, i.e., GRU. 

LSTM operates using three gates: **Input gate, Forget gate and Output gate**. Let’s take an example to understand the procedure of LSTM.
The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the forget gate layer. It looks at ht−1and xt, and outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”
The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the input gate layer decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C~t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.
Finally, we need to decide what we’re going to output. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.


## Vanishing Gradients Problem
The problem that vanilla RNNs face is vanishing gradients problem. Vanilla RNNS are incapable of handling long range dependencies. By long range dependencies, we mean that if we are trying to generate a word based on previous inputs then if the gap between the word to be generated and the previous input considered is small, RNNs will do a great job and able to predict the next word but if the gap is large, RNNs fail to perform. This is called the vanishing gradients problem. 


In [None]:
# Model for Q1
import tensorflow as tf
from tensorflow.keras.layers import BatchNormalization
model_q1 = tf.keras.Sequential()
model_q1.add(Embedding(input_dim = len(word_index)+1,
                       output_dim = 200,
                      weights = [embedding_matrix],
                      input_length = 30))
model_q1.add(LSTM(128, activation = 'tanh', return_sequences = True))
model_q1.add(Dropout(0.2))
model_q1.add(LSTM(128, return_sequences = True))
model_q1.add(LSTM(128))
model_q1.add(Dense(60, activation = 'tanh'))
model_q1.add(Dense(2, activation = 'sigmoid'))

In [None]:
# Model for Q2
model_q2 = tf.keras.Sequential()
model_q2.add(Embedding(input_dim = len(word_index)+1,
                       output_dim = 200,
                      weights = [embedding_matrix],
                      input_length = 30))
model_q2.add(LSTM(128, activation = 'tanh', return_sequences = True))
model_q2.add(Dropout(0.2))
model_q2.add(LSTM(128, return_sequences = True))
model_q2.add(LSTM(128))
model_q2.add(Dense(60, activation = 'tanh'))
model_q2.add(Dense(2, activation = 'sigmoid'))

In [None]:
# Merging the output of the two models,i.e, model_q1 and model_q2
mergedOut = Multiply()([model_q1.output, model_q2.output])

mergedOut = Flatten()(mergedOut)
mergedOut = Dense(100, activation = 'relu')(mergedOut)
mergedOut = Dropout(0.2)(mergedOut)
mergedOut = Dense(50, activation = 'relu')(mergedOut)
mergedOut = Dropout(0.2)(mergedOut)
mergedOut = Dense(2, activation = 'sigmoid')(mergedOut)

In [None]:
new_model = Model([model_q1.input, model_q2.input], mergedOut)
new_model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy',
                 metrics = ['accuracy'])
history = new_model.fit([X_train_q1,X_train_q2],Y_train, batch_size = 2000, epochs = 10)

In [None]:
y_pred = new_model.predict([X_test_q1, X_test_q2], batch_size=2000, verbose=1)
y_pred += new_model.predict([X_test_q1, X_test_q2], batch_size=2000, verbose=1)
y_pred /= 2