 <h1> Paraphrase Detection with Neural Networks - Natural Language Understanding </h1>


CSI4106 Artificial Intelligence<br/>
Project Type 3 : In-depth understanding of a solution approach to an AI problem <br/>
Prepared by Abha Sharma (8254435) & Rupsi Kaushik (8199148) <br/>
Group 33

<h2> Background </h2>

With the growing trends of virtual assistants and chatbots, Natural Language Processing (NLP) is a topic that is becoming increasingly popular in the recent years. From Google AI's Transformer-based models that consider a word's double-sided context to IBM's training data generator, today we have cutting edge approaches to solving NLP tasks.  However, even with these latest breakthroughs, NLP still faces many challenges, namely the problem of accurately deciphering what humans mean when they express something, regardless of how they express it. This problem falls under Natural Language Understanding (NLU), a subtopic of NLP that aims to increase the proficiency of intelligent systems in exhibiting real knowledge of natural language. Within this field, the task of paraphrase detection - determining whether a pair of sentences convey identical meaning - is considered to be an important one. Through the improvement of paraphrase detection, other NLP tasks that are integral to the efficiency of existing intelligent systems, such as question answering, information retrieval, and text summarization, can also be improved. For this reason, in this report, we propose to enhance the capability of neural networks in the context of paraphrase detection through the use of traditional Information Retrieval (IR) techniques as input features. 


<h2> Objectives</h2>

The main objective of this report is to evaluate the performance of a neural network model given different IR features. Additionally, it will take a look at how the quality and number of features and hidden layers improve the overall performance of the model. These results will be compared among two different training sets that have been annotated for paraphrase detection. Below is the proposed architecture for our particular neural network: 
<img src="CSI4106-NN.png">

<h2> Datasets </h2>

We will be working with the Quora Question Pairs and Microsoft Research Paraphrase Corpus datasets for this project. You can find them in this folder labelled as 'msr_train.csv' and 'questions_train.csv'. Each dataset contains pairs of sentences (Sentence_1 and Sentence_2), which have been annotated by humans to indicate whether these sentences capture a semantic equivalence (is_Paraphrase = 1) or not (is_Paraphrase = 0).

The Quora Question Pairs dataset was obtained from: https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs . 
And
the Microsoft Research Corpus dataset was obtained from: https://www.microsoft.com/en-ca/download/details.aspx?id=52398

Due to the computational and speed limitations of our machines, we decided to only look at 4500 samples from the Quora Question Pairs dataset. Similarly, although, Microsoft has provided the public with both the train and test dataset, we will only be using the train dataset. This train data set will later be split into train and test sets for our neural network. 

In [None]:
#Make sure to import all these modules
import pandas as pd
pd.options.mode.use_inf_as_na = True
from pyemd import emd
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.wsd import lesk
from nltk import ngrams
from difflib import SequenceMatcher
from gensim.models import Word2Vec
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
word_vectors = api.load("glove-wiki-gigaword-100")
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import preprocessing
import tensorflow as tf 
from tensorflow import keras
import re

In [None]:
# Taking a look at the Quora Question Pairs dataset
quora_data = pd.read_csv("questions_train.csv", error_bad_lines=False)
quora_data.Sentence_1 = quora_data.Sentence_1.astype(str)
quora_data.Sentence_2 = quora_data.Sentence_2.astype(str)
quora_data = quora_data[:4500]
quora_data.is_Paraphrase = quora_data.is_Paraphrase.astype(int)
quora_data.head(10)

In [None]:
#Taking a look at the Microsoft Research Paraphrase dataset 
mrp_data = pd.read_csv("msr_train.csv")
mrp_data.Sentence_1 = mrp_data.Sentence_1.astype(str)
mrp_data.Sentence_2 = mrp_data.Sentence_2.astype(str)
mrp_data.head(10)

In [None]:
#Taking a look at BoonBot Test Data 
boon_data = pd.read_csv('boon_test_nn.csv')
boon_data.head(10)

<h3> Dataset Quality </h3>

Both datasets are slightly imbalanced. The Quora dataset contains more of the ‘not a paraphrase’ sample examples while the Microsoft dataset contains more of the ‘is a paraphrase’ sample examples. 

In [None]:
print("Quora Data:\nTotal samples in each class:\n{}".format(quora_data['is_Paraphrase'].value_counts()))
print("Number of columns:{}".format(quora_data.shape[1]))
print("Number of rows: {}".format(quora_data.shape[0]))
print("\n")
print("Mrp Data:\nTotal samples in each class:\n{}".format(mrp_data['is_Paraphrase'].value_counts()))
print("Number of columns: {}".format(mrp_data.shape[1]))
print("Number of rows: {}".format(mrp_data.shape[0]))

<h2> Preprocessing & Transformation </h2>

In [None]:
#Removing common words that provide little to no value to semantic information within a sentence.
def remove_stop_words(sentence):
    stop_words = stopwords.words('english')
    processed_sentence = [re.sub(r"[,.!?&$]+",'', word) for word in sentence if not word in stop_words]
    return processed_sentence     
#Tokenize the sentence for further text processing.
def tokenize(sentence):
    tokenized_sentence = sentence.lower().split()
    return tokenized_sentence
#Lemmatization captures the root form of a word but ensures that it is a valid word in the language.
def lemmatize(sentence):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = [lemmatizer.lemmatize(word) for word in sentence]
    return lemmatized_sentence
#Gets synonym for a given word. This lets us capture more semantic information than string matching does.
def add_synonym(word):
    synonym_list = []
    for syn in wordnet.synsets(word):
        for name in syn.lemma_names():
             synonym_list.append(name.split(".")[0].replace('_',' '))
    return list(set(synonym_list))

'''Gets antonym for a given word. This helps us later when we get sentences like "I'm not happy."
    When we detect a negation, we are now able to better capture the semantic value.
'''
def add_antonym(word):
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if(lemma.antonyms()):
                return lemma.antonyms()[0].name()
            else:
                return None     
#Gets the hypernym (the broader category that a word belongs to) of a given word.(ie, clothing is a hypernym of shirt).  
def add_hypernym(word):
    for syn in wordnet.synsets(word):
        for hypernym in syn.hypernyms():
            return hypernym.name().split('.')[0]
#Handles negation term and includes antonyms of this given term. 
def tokenize_negation(sentence):
    negation_adverbs = ["no", "without","not", "n't", "never", "neith", "nor"]
    tokens_with_negation = []
    tokenized_sentence = tokenize(sentence)
    i = 0
    while i < (len(tokenized_sentence)):
        if (i != len(tokenized_sentence)-1) and (tokenized_sentence[i] in negation_adverbs):
            negation_token = add_antonym(tokenized_sentence[i+1])
            if(negation_token):
                tokens_with_negation.append(negation_token)
                i += 2 
            else:
                tokens_with_negation.append(tokenized_sentence[i])
                i +=1
        else:
            tokens_with_negation.append(tokenized_sentence[i])
            i += 1
    return tokens_with_negation

<h2> Baseline Model </h2>

<h3> Pairwise Similarity </h3>

<h4> Description </h4>

Cosine Similarity is a simple IR measure that calculates similarity among pairwise input vectors projected in a multi-dimensional space, based on their cosine angle. In its core, this method is syntactic as it is purely based on common word occurrences/counts and does not take into account word order or other semantic information. However, it is more than enough to capture similarity for a baseline model. *explain why we chose this*. Let's do an example by hand with Quora dataset in order to illustrate this measure. 
<h4> Illustration </h4>
<br /> Sentence_1: "What is the step by step guide to invest in share market in India?" <br/> Sentence_2: "What is the step by step guide to invest in share market?" <br />
After tokenization, stopword removal, and lemmatization, our sentences would look something like this: <br/>
Sentence_1: ['step', 'step', 'guide','invest','share','market','india'] <br />
Sentence_2: ['step', 'step', 'guide', 'invest', 'share', 'market'] <br />
Now we calculate <b>term frequency</b> and <b>inverse document frequency(tf-idf)</b>. Term frequency counts the frequency of word occurrence in each sentence: 

In [None]:
#The number of times a word occurs in each sentence
term_document_matrix = {"Document": ['Sentence_1', 'Sentence_2'],
                   "step":[2,2], 
                   "guide":[1,1], 
                   "invest":[1,1],
                   "share":[1,1],
                   "market":[1,1],
                   "india":[1,0]}
tdm_df = pd.DataFrame(term_document_matrix)
print(tdm_df)

The term frequency can be further normalized to take into account sentence length (ie, the tf for 'step' would now be 2/7 for Sentence_1 and 1/3 for Sentence_2).

In [None]:
#Normalized term frequency by accounting for total number of words in each sentence
extended_tdm = {"Document": ['Sentence_1', 'Sentence_2'],
                   "step":[0.29, 0.33], 
                   "guide":[0.14,0.16], 
                   "invest":[0.14,0.16],
                   "share":[0.14,0.16],
                   "market":[0.14,0.16],
                   "india":[0.14,0]}
extended_tdm_df = pd.DataFrame(extended_tdm)
print(extended_tdm_df)                  

Inverse document frequency lets us to put more value on the occurrence of rare terms and put less value on the frequently occurring terms throughout the sentences. This is done to acknowledge that the frequent occurrence of a word like 'step' in both documents is less distinguishing than the occurrence of a word like 'india'. This concept of idf can be captured through the equation: <b> log(N/df(w))</b>, where N is the total number of sentences and df (aka document frequency) is the number of sentences with word w in it. In our case, N is 2 at each comparison. This means that the df at word w in our case is:
df(step): 2, df(guide): 2, df(invest): 2, df(share): 2 , df(market): 2 , df(india): 1 <br /> 
The idf can, thus, be calculated through our equation as: log(2/2) +1 = 1, 1, 1, 1, 1, 1.3, respectively (adding 1 to accommodate for 0's). <br /> 
Then the tf-idf becomes tf * idf = 1*0.29 = 0.29, 0.14, 0.14, 0.14, 0.14, 0.182 for Sentence_1, and 0.33, 0.16, 0.16, 0.16,0.16, 0 for Sentence_2, respectively. <br />
Now we can calculate the cosine similarity through the dot product: <img src="cosine.png"> <br />
Where d2 is the tf-idf vector for Sentence_1 and q is the tf-idf vector for Sentence_2 that we previously calculated. <br/>
 = (0.29 * 0.33 + 0.14 * 0.16 + ...) / (0.29 * 0.29 + 0.14 * 0.14 + ...) (0.33 * 0.33 + 0.16 * 0.16 + ...) <br />
 = 0.182/0.203 <br />
 = approx 0.89 or 0.46 radians <br/> 
 Therefore, the cosine similarity of Sentence_1 and Sentence_2 is approximately 0.89, making them highly similar to one another.  <br />
This process has been coded below with the help of sklearn in order to define our baseline approach. 

In [None]:
def calculate_cosine_similarity(sentence_one, sentence_two):
    #handles the calculation of tfidf and building of term document matrices for us
    tfidf = TfidfVectorizer(preprocessor=' '.join)
    tfidf_matrix = tfidf.fit_transform([sentence_one, sentence_two])
    #calculates the cosine similarity of the pairwise sentences
    similarity = cosine_similarity(tfidf_matrix)[0,1]
    return similarity

In [None]:
def get_cosine_similarity(row):
    sentence_one_tokenize = tokenize_negation(row['Sentence_1'])
    sentence_two_tokenize = tokenize_negation(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    lemmatize_sentence_one = lemmatize(filtered_sentence_one)
    lemmatize_sentence_two = lemmatize(filtered_sentence_two)
    return calculate_cosine_similarity(lemmatize_sentence_one, lemmatize_sentence_two)

In [None]:
#Applying Cosine Similaritye to Quora sentence pairs 
quora_data['Cosine_Similarity'] = quora_data.apply(get_cosine_similarity, axis=1)
quora_data.head(10)

In [None]:
#Applying Cosine Similarity Coefficient to MRP Corpus sentence pairs 
mrp_data['Cosine_Similarity'] = mrp_data.apply(get_cosine_similarity, axis=1)
mrp_data.head(10)

In [None]:
#Applying Cosine Similarity to BoonBot test data
boon_data['Cosine_Similarity'] = boon_data.apply(get_cosine_similarity, axis=1)
boon_data.head(30)

<h3>Evaluation Measures </h3>

We wil be using accuracy, recall, and precision in order to evaluate our models. 

In [None]:
#Calculate how many were right predictions out of the total predicted
def calculate_accuracy(model, actual_tags):
    correct = 0
    total = 0
    for prediction, actual in zip(model, actual_tags):
        total += 1
        if prediction == actual:
            correct += 1 
    accuracy = correct / total 
    return '{0:.1%}'.format(accuracy)
#Keeps track of true positives, true negatives, false positives, and false negatives.
def build_confusion_matrix(actual_tags, model, classOfInterest):
    confusion_matrix = {}
    truePositives = len([p for p, a in zip(model, actual_tags) if p == a and p == classOfInterest])
    trueNegatives = len([p for p, a in zip(model, actual_tags) if p == a and p != classOfInterest])
    falsePositives = len([p for p, a in zip(model, actual_tags) if p != a and p == classOfInterest])
    falseNegatives = len([p for p, a in zip(model, actual_tags) if p != a and p != classOfInterest])
    confusion_matrix["tp"] = truePositives
    confusion_matrix["tn"] = trueNegatives
    confusion_matrix["fp"] = falsePositives 
    confusion_matrix["fn"] = falseNegatives
    return confusion_matrix
#Calculate how many relevant predictions are retrieved in general
def calculate_recall(model, actual_tags, classOfInterest):
    matrix = build_confusion_matrix(model, actual_tags, classOfInterest)
    recall = matrix["tp"] / ( matrix["tp"] + matrix ["fn"])
    return '{0:.1%}'.format(recall)
#Calculate how many retrieved predictions are relevant
def calculate_precision(model, actual_tags, classOfInterest):
    matrix = build_confusion_matrix(model, actual_tags, classOfInterest)
    precision = matrix["tp"]/ (matrix["tp"] + matrix["fp"])
    return '{0:.1%}'.format(precision)

<h3> Baseline Model Threshold </h3>

We now need to decide what similarity value is adequate to determine whether a pair of sentences are semantically equivalent. Are we going to accept them as a paraphrase if the measure is above 0.5? Above 0.8? Instead of randomly picking a threshold for the baseline method, we are going to 'learn' a threshold that yields relatively best results for us in terms of accuracy, recall, and precision.

In [None]:
def apply_cosine_classification(row, **kwargs):
    if(row['Cosine_Similarity'] > kwargs['threshold']):
        classification = 1
    else:
        classification = 0
    return classification

thresholds = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
quora_data_thresholds = {}
mrp_data_thresholds = {}
boon_data_thresholds = {}
accuracies = {}
recalls = {}
precisions = {}
for threshold in thresholds:
    quora_data_thresholds[threshold] = quora_data.apply(apply_cosine_classification, threshold = threshold, axis = 1)
    accuracies[threshold] = calculate_accuracy(quora_data_thresholds[threshold], quora_data['is_Paraphrase'])
    recalls[threshold] = calculate_recall(quora_data_thresholds[threshold], quora_data['is_Paraphrase'], 1)
    precisions[threshold] = calculate_precision(quora_data_thresholds[threshold], quora_data['is_Paraphrase'], 1)
print("Quora Accuracy: {}, \n Quora Recall: {},\nQuora Precision: {}" .format(accuracies, recalls, precisions))

for threshold in thresholds:
    mrp_data_thresholds[threshold] = mrp_data.apply(apply_cosine_classification, threshold = threshold, axis = 1)
    accuracies[threshold] = calculate_accuracy(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'])
    recalls[threshold] = calculate_recall(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'],1)
    precisions[threshold] = calculate_precision(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'], 1)
print("MRP Accuracy: {}, \nMRP Recall: {},\nMRP Precision: {}" .format(accuracies, recalls, precisions))

<h4> Picking a Threshold </h4>
We chose 0.45 as our threshold. We take into account all three metrics and the results of both datasets but put a higher emphasis on accuracy and precision. A high accuracy can indicate that the model is great, however, all the other parameters must be taken into account. We might have higher accuracies at other thresholds but taking a look at the confusion matrix at each threshold, there is an imbalance of classes at these thresholds with higher accuracies, which is not necessarily good. For example, let’s say our dataset has 99% of one class and 1% of the other, then our accuracy is pretty high but this  clearly shows that there is equal cost to false positives and false negatives.  Therefore, metrics need to also be relative to each class. The precision at 0.45 is overall decent for both datasets, indicating that at this threshold the specificity is pretty good. 

In [None]:
quora_baseline = {}
quora_baseline['Cosine_Classification'] = quora_data.apply(apply_cosine_classification, threshold=0.45, axis=1)
quora_baseline_df = pd.DataFrame(quora_baseline)
quora_baseline_df

In [None]:
mrp_baseline = {}
mrp_baseline['Cosine_Classification'] = mrp_data.apply(apply_cosine_classification, threshold = 0.45,  axis=1)
mrp_baseline_df = pd.DataFrame(mrp_baseline)
mrp_baseline_df

In [None]:
boon_baseline = {}
boon_baseline['Cosine_Classification'] = boon_data.apply(apply_cosine_classification, threshold = 0.45, axis = 1)
boon_baseline_df = pd.DataFrame(boon_baseline)
boon_baseline_df.head(10)

<h3> Measures and Results </h3>

In [None]:
def baseline_evaluation_summary(baseline, df):
    baseline_accuracy = calculate_accuracy(baseline['Cosine_Classification'], df['is_Paraphrase'])
    baseline_recall = calculate_recall(baseline['Cosine_Classification'], df['is_Paraphrase'], 1)
    baseline_precision = calculate_precision(baseline['Cosine_Classification'], df['is_Paraphrase'], 1)
    
    baseline_evaluation_summary = {"Model": ['Baseline'],
                   "Accuracy":[(baseline_accuracy)], 
                   "Recall":[baseline_recall], 
                   "Precision":[baseline_precision]}
    results_df = pd.DataFrame(baseline_evaluation_summary)
    print (results_df)

In [None]:
print('Quora Data Baseline Evaluation')
baseline_evaluation_summary(quora_baseline, quora_data)
print('Microsoft Data Baseline Evaluation')
baseline_evaluation_summary(mrp_baseline, mrp_data)
print('BoonBot Data Baseline Evaluation')
baseline_evaluation_summary(boon_baseline, boon_data)
#wherever the cosine similarity is bigger, choose that 
#where they're equal - reprompt the user to choose what they mean 

<h2>Feature Engineering</h2>

<h3>Syntactic Similarity</h3>

<h4>Edit Distance</h4>

Edit distance is a measure of similarity between two strings, source and target. It is the minimum number of operations (insertion, deletion, or substitution) required to transform the source string to target string. For example, the edit distance between 'monkey' and 'money' is 1. Deletion of 'k' in 'monkey' will give us 'money'.
Here, we have implemented such edit distance but on the word level instead of character level. This was done by transforming the sentences to list and then comparing each element of the source and the target. 

In [None]:
def edit_distance(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    return nltk.edit_distance(sentence_one_tokenize, sentence_two_tokenize)

In [None]:
#Applying Edit Distance to Quora sentence pairs 
quora_data['Edit_distance'] = quora_data.apply(edit_distance, axis=1)
quora_data.head(10)

In [None]:
#Applying Edit Distance to MRP Corpus sentence pairs 
mrp_data['Edit_distance'] = mrp_data.apply(edit_distance, axis=1)
mrp_data.head(10)

In [None]:
boon_data['Edit_distance'] = boon_data.apply(edit_distance, axis=1)
boon_data.head(10)

<h4>Jaccard Similarity Coefficient</h4>

Jaccard Similarity Coefficient is another similarity measure. It is calculated by dividing the intersection (common words) of the two sentences over the union of the two sentences(length of sentence one + length of sentence two - intersection).

In [None]:
def jaccard_sim_coefficient(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    intersection = 0
    for word_in_one in sentence_one_tokenize:
        if word_in_one in sentence_two_tokenize:
            intersection += 1
    union = (len(sentence_one_tokenize) + len(sentence_two_tokenize) - intersection)
    return intersection/union

In [None]:
#Applying Jaccard Similarity Coefficient to Quora sentence pairs 
quora_data['Jaccard_similarity'] = quora_data.apply(jaccard_sim_coefficient, axis=1)
quora_data.head(10)

In [None]:
#Applying Jaccard Similarity Coefficient to MRP Corpus sentence pairs 
mrp_data['Jaccard_similarity'] = mrp_data.apply(jaccard_sim_coefficient, axis=1)
mrp_data.head(10)

In [None]:
boon_data['Jaccard_similarity'] = boon_data.apply(jaccard_sim_coefficient, axis = 1)
boon_data.head(10)

<h4>Sequence Matcher</h4>

SequenceMatcher is a class in the python module difflib. It finds the length of the longest contiguous matching subsequence. The ratio then divides it by the total length of characeters of both sentences and multiplies it by 2. This returns the similarity score (float in [0,1]). For example, <br>
[THANK]S[ FOR][ RESPONSE]and [THANK]ING[ FOR] KIND[ RESPONSE] has 18 characters in the longest subsequence, including spaces. Therefore, the ratio will output 0.8 (2*18/45).

In [None]:
def sequence_matcher(row):
    return SequenceMatcher(None, row['Sentence_1'], row['Sentence_2']).ratio()

In [None]:
#Applying Sequence Matcher to Quora sentence pairs 
quora_data['Sequence_matcher'] = quora_data.apply(sequence_matcher, axis=1)
quora_data.head(10)

In [None]:
#Applying Sequence Matcher to MRP Corpus sentence pairs 
mrp_data['Sequence_matcher'] = mrp_data.apply(sequence_matcher, axis=1)
mrp_data.head(10)

In [None]:
boon_data['Sequence_matcher'] = boon_data.apply(sequence_matcher, axis = 1)
boon_data.head(10)

<h4>N-gram measure</h4>

N-gram is a sequence of N words. Here, we create N-grams of both sentences. We then look at the common grams and divide it by the union of grams (in other words perform a jaccard coefficient with the n-grams). Through our research, we found that N = 3 or 4 is commonly used and are optimal at capturing the probability of a given word given the previous words. We chose N = 3 because we felt that this would be able to capture context even for shorter sentences.

In [None]:
def ngram_measure(row):
    n = 3
    common_count = 1
    grams_sentence_one = ngrams(row['Sentence_1'].split(), n)
    grams_sentence_two = ngrams(row['Sentence_2'].split(), n)
    grams_sentence_one_total = sum(1 for x in grams_sentence_one)
    grams_sentence_two_total = sum(1 for x in grams_sentence_two)
    for gram_in_one in grams_sentence_one:
        if gram_in_one in grams_sentence_two:
            common_count += 1
    union = grams_sentence_one_total + grams_sentence_two_total - common_count
    return common_count / union

In [None]:
#Applying N-gram Measure to Quora sentence pairs 
quora_data['N-gram_measure'] = quora_data.apply(ngram_measure, axis=1)
quora_data.head(10)

In [None]:
#Applying N-gram Measure to MRP Corpus sentence pairs 
mrp_data['N-gram_measure'] = mrp_data.apply(ngram_measure, axis=1)
mrp_data.head(10)

In [None]:
boon_data['N-gram_measure'] = boon_data.apply(ngram_measure, axis = 1)
boon_data.head(10)

<h3>Semantic Similarity</h3>

<h4> Word Mover's Distance </h4>


Word Mover's Distance uses normalized bag of words and word embeddings to calculate the distance between sentences. It retrieves vectors from pre-trained word embeddings models for the words of the sentences. The key assumption with this similarity measure is that similar words should have similar vectors. <br/>
For example, 'Obama speaks to the media in Illinois' and 'The president greets the press in Chicago' have the same meaning, however they do not have any words in common. Word Mover's Distance helps with this. 

In [None]:
def word_movers_distance(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    distance = word_vectors.wmdistance(filtered_sentence_one, filtered_sentence_two)
    return distance

In [None]:
#Applying Word Mover's Distance to Quora sentence pairs 
quora_data['WMD_distance'] = quora_data.apply(word_movers_distance, axis=1)
quora_data.head(10)

In [None]:
#Applying Word Mover's Distance to MRP Corpus sentence pairs 
mrp_data['WMD_distance'] = mrp_data.apply(word_movers_distance, axis=1)
mrp_data.head(10)

In [None]:
boon_data['WMD_distance'] = boon_data.apply(word_movers_distance, axis=1)
boon_data.head(10)

<h4>Named Entity Recognition Similarity</h4>

In this feature, we first collected NER words for each sentences along with their label. For example, 'Washington' will have a label of 'GPE' for geo-political entities. We computed Jaccard Coefficient by dividing the common NER (with label) over the union of NER of both sentences. 
In the case where no NER was detected, in neither of the sentences, we simply returned 0, else we returned the Jaccard Coefficient.

In [None]:
def ner_measure(row):
    ner_sentence_one=[]
    ner_sentence_two=[]
    count_common_ner = 0
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(row['Sentence_1']))):
        if hasattr(chunk, 'label'):
            ner_sentence_one.append(chunk)
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(row['Sentence_2']))):
        if hasattr(chunk, 'label'):
            ner_sentence_two.append(chunk)
    for item in ner_sentence_one:
        if item in ner_sentence_two:
            count_common_ner += 1
    union = len(ner_sentence_one) + len(ner_sentence_two) - count_common_ner
    if union == 0:
        return 0
    else:
        return count_common_ner / union

In [None]:
#Applying NER Measure to Quora sentence pairs 
quora_data['NER_similarity'] = quora_data.apply(ner_measure, axis=1)
quora_data.head(10)

In [None]:
#Applying NER Measure to MRP Corpus sentence pairs 
mrp_data['NER_similarity'] = mrp_data.apply(ner_measure, axis=1)
mrp_data.head(10)

In [None]:
boon_data['NER_similarity'] = boon_data.apply(ner_measure, axis=1)
boon_data.head(10)

<h4>Word Sense Disambiguation</h4>

Word Sense Disambiguation finds the best sense of a word from all the given senses of the word. The Lesk algorithm uses WordNet and gets the gloss of all the senses of the word in the sentence and then calculates the maximum overlap with the senses, returning whichever gives the maximum overlap. For example, let's take the phrase 'pine cone'. 'Pine' has two senses. Sense 1: kind of evergreen tree with needle-shaped leaves and Sense 2: waste away through sorrow or illness. 'Cone' has three senses. Sense 1: solid body which narrows to a point. Sense 2: something of this shape whether solid or hollow. Sense 3: fruit of a certain evergreen tree. Comparing the senses of the two words, we can see that 'evergreen tree' is common in one sense of each word. Therefore, Sense 1 of Pine and Sense 3 of Cone are the most appropriate when 'pine' and 'cone' are used together.  

Using this knowledge, we created our feature. After Lesk was applied to each sentences, where the most appropriate senses of each word was detected, we looked for the common senses in the two sentences. To normalize our result, we again used Jaccard. 

In [None]:
def wsd(row):
    sentence_one_senses = []
    sentence_two_senses = []
    common_senses = 0
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    for word in sentence_one_tokenize:
        sentence_one_senses.append(lesk(row['Sentence_1'], word))
    for word in sentence_two_tokenize:
        sentence_two_senses.append(lesk(['Sentenece_2'], word))
    sentence_one_senses = (set(sentence_one_senses))
    sentence_two_senses = (set(sentence_two_senses))

    for sense in sentence_one_senses:
        if sense in sentence_two_senses:
            common_senses += 1
    return common_senses / (len(sentence_one_senses) + len(sentence_two_senses) - common_senses)

In [None]:
#Applying WSD Measure to Quora sentence pairs 
quora_data['WSD'] = quora_data.apply(wsd, axis=1)
quora_data.head(10)

In [None]:
#Applying WSD Measure to MRP Corpus sentence pairs 
mrp_data['WSD'] = mrp_data.apply(wsd, axis=1)
mrp_data.head(10)

In [None]:
boon_data['WSD'] = boon_data.apply(wsd, axis=1)
boon_data.head(10)

<h4> Wordnet Extended Cosine Similarity </h4>
Here, we use WordNet's capabilities to extend our sentences and, therefore, extend our dictionary. We extend our semantic reach by including synonyms, hypernyms, and antonyms in our dictionary. We then call our existing calculate_cosine_similarity method to get the similarity of the extended documents.

In [None]:
def extend_sentence_wordnet(row):
    sentence_one_tokenize = tokenize_negation(row['Sentence_1'])
    sentence_two_tokenize = tokenize_negation(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    lemmatize_sentence_one = lemmatize(filtered_sentence_one)
    lemmatize_sentence_two = lemmatize(filtered_sentence_two)
    # get extended synonym list    
    extended_dictionary_one = []
    extended_dictionary_two = []
    for one, two in zip(lemmatize_sentence_one, lemmatize_sentence_two):
        synonym_one = add_synonym(one)
        synonym_two = add_synonym(two)
        hypernym_one = add_hypernym(one)
        hypernym_two = add_hypernym(two)
        if(synonym_one):
            extended_dictionary_one += synonym_one
        if(hypernym_one):
            extended_dictionary_one += hypernym_one
        if(synonym_two):
            extended_dictionary_two += synonym_two
        if(hypernym_two):
            extended_dictionary_two += hypernym_two
    lemmatize_sentence_one += extended_dictionary_one
    lemmatize_sentence_two += extended_dictionary_two
    
    #calculate similarity based on the extended list 
    similarity = calculate_cosine_similarity(lemmatize_sentence_one, lemmatize_sentence_two)
    return similarity

In [None]:
#Applying Synonym_Hypernym_Cosine to Quora sentence pairs 
quora_data['Synonym_Hypernym_Cosine'] = quora_data.apply(extend_sentence_wordnet, axis=1)
quora_data.head(10)

In [None]:
#Applying Synonym_Hypernym_Cosine to MRP Corpus sentence pairs 
mrp_data['Synonym_Hypernym_Cosine'] = mrp_data.apply(extend_sentence_wordnet, axis = 1)
mrp_data.head(10)

In [None]:
boon_data['Synonym_Hypernym_Cosine'] = boon_data.apply(extend_sentence_wordnet, axis=1)
boon_data.head(10)

<h2> Neural Network Model </h2>

<h3> Post Processing </h3>

We noticed that some of the features contained infinite values, especially the WMD measure. To handle this, we replaced the infinite values with the maximum value of that particular feature. We then normalized the feature so that all values were within the range of 0 to 1.

In [None]:
max_mrp = mrp_data.loc[mrp_data['WMD_distance'] != np.nan, 'WMD_distance'].max()
mrp_data['WMD_distance'].replace(np.nan, max_mrp, inplace=True)
x = mrp_data[['WMD_distance']].values.astype(float)
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
mrp_data['WMD_distance'] = pd.DataFrame(x_scaled)

In [None]:
max_quora = quora_data.loc[quora_data['WMD_distance'] != np.nan, 'WMD_distance'].max()
quora_data['WMD_distance'].replace(np.nan, max_quora, inplace=True)
x = quora_data[['WMD_distance']].values.astype(float)
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
quora_data['WMD_distance'] = pd.DataFrame(x_scaled)

In [None]:
max_boon = boon_data.loc[boon_data['WMD_distance'] != np.nan, 'WMD_distance'].max()
boon_data['WMD_distance'].replace(np.nan, max_boon, inplace=True)
x = boon_data[['WMD_distance']].values.astype(float)
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
boon_data['WMD_distance'] = pd.DataFrame(x_scaled)

<h3> Multi-layer Perceptron </h3>

In [None]:
def multilayer_perceptron(df, model):
    features = list(df.columns.values)
    features.remove('is_Paraphrase')
    features.remove('Sentence_1')
    features.remove('Sentence_2')
    X = df[features]
    y = df['is_Paraphrase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

    model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy'])
    model.fit(X_train, y_train, epochs=15, batch_size=1)


    test_loss, test_acc  = model.evaluate(X_test, y_test)
    print('Test accuracy:{}'.format(test_acc))
    return model 

Based on our research, we found that a common rule of thumb that most researchers use when deciding the number of neurons to implement in the hidden layer is ‘to use a number between the size of the input and size of the output layers’, as this results in the optimal size. Following this rule of thumb, we decided to  keep our number of neurons in the hidden layer to 7 as it in between 9 which is the total number of our inputs and 1 which is the number of our output. We also noticed that there was an improvement in our model, when we increased the number of hidden layers from one to two. However, increasing the hidden layers from two made no significant difference.

In [None]:
# Quora data NN
model_quora = keras.Sequential([
    keras.layers.Flatten(input_shape=(9,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])
print("Multi-Layer Perceptron Results for quora_data")
multilayer_perceptron(quora_data, model_quora)

In [None]:
# Mircosoft data NN
model_mrp = keras.Sequential([
    keras.layers.Flatten(input_shape=(9,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])
print("Multi-Layer Perceptron Results for mrp_data")
model_mrp = multilayer_perceptron(mrp_data, model_mrp)

In [None]:
# Boon data NN
model_boon = keras.Sequential([
    keras.layers.Flatten(input_shape=(9,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])
print("Multi-Layer Perceptron Results for boon")
model_boon = multilayer_perceptron(boon_data, model_boon)

<h4>Feature Testing</h4>
We decided to take out each feature and see how the Neural Network performs. The logic for our approach is that if the metrics are higher, then it means that the removed feature is not adding as much value to the overall model. For Quora, we found that the worst feature was WMD distance. For Microsoft, the worst one was the word n-gram. However, we did not find a particular feature that outperformed another or did noticeably worse than the other, since all the metrics were similar to each other. In the future, we plan on looking for more efficient ways to do feature evaluation such as permutation feature importance, where the value of the features are shuffled rather than dropped entirely like we did. This process would also tell us what our best features were in a more logical manner, which is more helpful  to us. 

In [None]:
def multilayer_perceptron_feature_testing(df, model, test_feature):
    features = list(df.columns.values)
    features.remove('is_Paraphrase')
    features.remove('Sentence_1')
    features.remove('Sentence_2')
    features.remove(test_feature)
    X = df[features]
    y = df['is_Paraphrase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

    model.compile(optimizer='adam',loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])
    model.fit(X_train, y_train, epochs=15, batch_size=1)

    test_loss, test_acc, test_pre, test_recall = model.evaluate(X_test, y_test)
    print('Test accuracy:{}, test recall: {}, test precision: {}'.format(test_acc, test_recall, test_pre))

In [None]:
# Quora data NN feature testing 
features = list(quora_data.columns.values)
features.remove('is_Paraphrase')
features.remove('Sentence_1')
features.remove('Sentence_2')

for feature in features:
    print("Removing feature {}".format(feature))
    model_quora_testing = keras.Sequential([
    keras.layers.Flatten(input_shape=(8,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])
    multilayer_perceptron_feature_testing(quora_data, model_quora_testing, feature)
    print("\n")

In [None]:
# Microsfot data feature testing
features = list(mrp_data.columns.values)
features.remove('is_Paraphrase')
features.remove('Sentence_1')
features.remove('Sentence_2')

for feature in features:
    print("Removing feature {}".format(feature))
    model_mrp_testing = keras.Sequential([
    keras.layers.Flatten(input_shape=(8,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])
    multilayer_perceptron_feature_testing(mrp_data, model_mrp_testing, feature)
    print("\n")

<h2> Evaluation and Conclusion </h2>

In conclusion, our baseline model performed better on the Quora data set, while our Neural Network performed better on the Microsoft data set. This could be because the Quora data set contains more noisy data and is closer to day-to-day human language in the sense that it contains more slang and the use of emoji, while Microsoft data set is more formal. Our models performed pretty well capturing similarities, however this is not the same as capturing paraphrases. For example, recall the sentences ‘What is the step by step guide to invest in share market in India?’ vs ‘What is the step by step guide to invest in share market?’ that we saw before. These would have a high similarity as calculated by our input features, however, these would not be considered paraphrases because of the one distinguishing word 'india' that alters the overall context. Additionally, sometimes sentences are paraphrases but they don’t have any words in common, meaning that our input measures would not be able to capture this. We thought that our WMD distance measure would aid in this. However, it is very likely that WMD distance measure on its own would not be able to influence the Neural Network. We hypothesized that syntactic and semantic similarity were both important while predicting if a pair of sentences were paraphrases. We built our features based on this hypothesis, creating both syntactic and semantic features. However, in the future, we may have to consider having more semantic features rather than a balance of both semantic and syntactic features. Furthermore, we only deal with 9 input features for our model when ideally we would use way more, encompassing topic features, linguistic features, and much more.  Lastly, we could increase the number of samples and the way we sample our data. For example, using k-fold validation and balancing our classes through over sampling. 

<h2> References </h2>

https://python.gotrained.com/nltk-edit-distance-jaccard-distance/ <br>
http://www.nltk.org/_modules/nltk/model/ngram.html <br>
https://medium.com/@nikhiljaiswal_9475/sequencematcher-in-python-6b1e6f3915fc <br>
https://towardsdatascience.com/word-distance-between-word-embeddings-cc3e9cf1d632 <br>
https://radimrehurek.com/gensim/models/keyedvectors.html <br>
https://stackoverflow.com/questions/31836058/nltk-named-entity-recognition-to-a-python-list/48738383 <br>
https://www.kaggle.com/antriksh5235/semantic-similarity-using-wordnet <br>
https://pdfs.semanticscholar.org/651e/e5def5cabff3cdf03b6c1a44c00aad9ef527.pdf <br>
https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw <br>
https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/ <br>
http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/ <br>