 <h1> Paraphrase Detection with Neural Networks - Natural Language Understanding </h1>


CSI4106 Artificial Intelligence<br/>
Project Type 3 <br />
Prepared by Abha Sharma & Rupsi Kaushik

<h2> Background </h2>

With the growing trends of virtual assistants and chatbots, Natural Language Processing (NLP) is a topic that is becoming increasingly popular in the recent years. From Google AI's Transformer-based models that consider a word's double-sided context to IBM's training data generator, today we have cutting edge approaches to solving NLP tasks.  However, even with these latest breakthroughs, NLP still faces many challenges, namely the problem of accurately deciphering what humans mean when they express something, regardless of how they express it. This problem falls under Natural Language Understanding (NLU), a subtopic of NLP that aims to increase the proficiency of intelligent systems in exhibiting real knowledge of natural language. Within this field, the task of paraphrase detection - determining whether a pair of sentences convey identical meaning - is considered to be an important one. Through the improvement of paraphrase detection, other NLP tasks that are integral to the efficiency of existing intelligent systems, such as question answering, information retrieval, and text summarization, can also be improved. For this reason, in this report, we propose to enhance the capability of neural networks in the context of paraphrase detection through the use of traditional Information Retrieval (IR) techniques as input features. 


<h2> Objectives - edit at the end </h2>

The main objective of this report is to evaluate the performance of a neural network model given different IR features. Additionally, it will take a look at how the number of features and hidden layers improve the overall performance of the model. These results will be compared among two different training sets that have been annotated for paraphrase detection. Below is the proposed architecture for our particular neural network: 
<img src="CSI4106-NN.png">

<h2> Datasets </h2>

We will be working with the Quora Question Pairs and Microsoft Research Paraphrase Corpus datasets for this project. You can find them in this folder labelled as 'msr_train.csv' and 'questions_train.csv'. Each dataset contains pairs of sentences (Sentence_1 and Sentence_2), which have been annotated by humans to indicate whether these sentences capture a semantic equivalence (is_Paraphrase = 1) or not (is_Paraphrase = 0).

In [17]:
#Make sure to import all these modules
import pandas as pd
pd.options.mode.use_inf_as_na = True
from pyemd import emd
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.wsd import lesk
from nltk import ngrams
from difflib import SequenceMatcher
from gensim.models import Word2Vec
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
word_vectors = api.load("glove-wiki-gigaword-100")
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn import preprocessing
import tensorflow as tf 
from tensorflow import keras
import re

In [2]:
# Taking a look at the Quora Question Pairs dataset
quora_data = pd.read_csv("questions_train.csv", error_bad_lines=False)
quora_data.Sentence_1 = quora_data.Sentence_1.astype(str)
quora_data.Sentence_2 = quora_data.Sentence_2.astype(str)
quora_data = quora_data[:4500]
quora_data.is_Paraphrase = quora_data.is_Paraphrase.astype(int)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,How can I be a good geologist?,What should I do to be a great geologist?,1
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [3]:
#Taking a look at the Microsoft Research Paraphrase dataset 
mrp_data = pd.read_csv("msr_train.csv")
mrp_data.Sentence_1 = mrp_data.Sentence_1.astype(str)
mrp_data.Sentence_2 = mrp_data.Sentence_2.astype(str)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1


<h3> Dataset Quality </h3>

In [4]:
print("Quora Data:\n{}".format(quora_data['is_Paraphrase'].value_counts()))
print("Mrp Data:\n{}".format(mrp_data['is_Paraphrase'].value_counts()))

Quora Data:
0    2789
1    1711
Name: is_Paraphrase, dtype: int64
Mrp Data:
1    2668
0    1281
Name: is_Paraphrase, dtype: int64


<h2> Preprocessing & Transformation </h2>

In [25]:
#Removing common words that provide little to no value to semantic information within a sentence.
def remove_stop_words(sentence):
    stop_words = stopwords.words('english')
    processed_sentence = [re.sub(r"[,.!?&$]+",'', word) for word in sentence if not word in stop_words]
    return processed_sentence     
#Tokenize the sentence for further text processing.
def tokenize(sentence):
    tokenized_sentence = sentence.lower().split()
    return tokenized_sentence
#Lemmatization captures the root form of a word but ensures that it is a valid word in the language.
def lemmatize(sentence):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = [lemmatizer.lemmatize(word) for word in sentence]
    return lemmatized_sentence
#Gets synonym for a given word. This lets us capture more semantic information than string matching does.
def add_synonym(word):
    synonym_list = []
    for syn in wordnet.synsets(word):
        for name in syn.lemma_names():
             synonym_list.append(name.split(".")[0].replace('_',' '))
    return list(set(synonym_list))

'''Gets antonym for a given word. This helps us later when we get sentences like "I'm not happy."
    When we detect a negation, we are now able to better capture the semantic value.
'''
def add_antonym(word):
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if(lemma.antonyms()):
                return lemma.antonyms()[0].name()
            else:
                return None     
#Gets the hypernym (the broader category that a word belongs to) of a given word.(ie, clothing is a hypernym of shirt).  
def add_hypernym(word):
    for syn in wordnet.synsets(word):
        for hypernym in syn.hypernyms():
            return hypernym.name().split('.')[0]
#Handles negation term and includes antonyms of this given term. 
def tokenize_negation(sentence):
    negation_adverbs = ["no", "without","not", "n't", "never", "neith", "nor"]
    tokens_with_negation = []
    tokenized_sentence = tokenize(sentence)
    i = 0
    while i < (len(tokenized_sentence)):
        if (i != len(tokenized_sentence)-1) and (tokenized_sentence[i] in negation_adverbs):
            negation_token = add_antonym(tokenized_sentence[i+1])
            if(negation_token):
                tokens_with_negation.append(negation_token)
                i += 2 
            else:
                tokens_with_negation.append(tokenized_sentence[i])
                i +=1
        else:
            tokens_with_negation.append(tokenized_sentence[i])
            i += 1
    return tokens_with_negation

<h2> Baseline Model </h2>

<h3> Pairwise Similarity </h3>

<h4> Description </h4>

Cosine Similarity is a simple IR measure that calculates similarity among pairwise input vectors projected in a multi-dimensional space, based on their cosine angle. In its core, this method is syntactic as it is purely based on common word occurrences/counts and does not take into account word order or other semantic information. However, it is more than enough to capture similarity for a baseline model. *explain why we chose this*. Let's do an example by hand with Quora dataset in order to illustrate this measure. 
<h4> Illustration </h4>
<br /> Sentence_1: "What is the step by step guide to invest in share market in India?" <br/> Sentence_2: "What is the step by step guide to invest in share market?" <br />
After tokenization, stopword removal, and lemmatization, our sentences would look something like this: <br/>
Sentence_1: ['step', 'step', 'guide','invest','share','market','india'] <br />
Sentence_2: ['step', 'step', 'guide', 'invest', 'share', 'market'] <br />
Now we calculate <b>term frequency</b> and <b>inverse document frequency(tf-idf)</b>. Term frequency counts the frequency of word occurrence in each sentence: 

In [26]:
#The number of times a word occurs in each sentence
term_document_matrix = {"Document": ['Sentence_1', 'Sentence_2'],
                   "step":[2,2], 
                   "guide":[1,1], 
                   "invest":[1,1],
                   "share":[1,1],
                   "market":[1,1],
                   "india":[1,0]}
tdm_df = pd.DataFrame(term_document_matrix)
print(tdm_df)

     Document  step  guide  invest  share  market  india
0  Sentence_1     2      1       1      1       1      1
1  Sentence_2     2      1       1      1       1      0


The term frequency can be further normalized to take into account sentence length (ie, the tf for 'step' would now be 2/7 for Sentence_1 and 1/3 for Sentence_2).

In [34]:
#Normalized term frequency by accounting for total number of words in each sentence
extended_tdm = {"Document": ['Sentence_1', 'Sentence_2'],
                   "step":[0.29, 0.33], 
                   "guide":[0.14,0.16], 
                   "invest":[0.14,0.16],
                   "share":[0.14,0.16],
                   "market":[0.14,0.16],
                   "india":[0.14,0]}
extended_tdm_df = pd.DataFrame(extended_tdm)
print(extended_tdm_df)                  

     Document  step  guide  invest  share  market  india
0  Sentence_1  0.29   0.14    0.14   0.14    0.14   0.14
1  Sentence_2  0.33   0.16    0.16   0.16    0.16   0.00


Inverse document frequency lets us to put more value on the occurrence of rare terms and put less value on the frequently occurring terms throughout the sentences. This is done to acknowledge that the frequent occurrence of a word like 'step' in both documents is less distinguishing than the occurrence of a word like 'india'. This concept of idf can be captured through the equation: <b> log(N/df(w))</b>, where N is the total number of sentences and df (aka document frequency) is the number of sentences with word w in it. In our case, N is 2 at each comparison. This means that the df at word w in our case is:
df(step): 2, df(guide): 2, df(invest): 2, df(share): 2 , df(market): 2 , df(india): 1 <br /> 
The idf can, thus, be calculated through our equation as: log(2/2) +1 = 1, 1, 1, 1, 1, 1.3, respectively (adding 1 to accommodate for 0's). <br /> 
Then the tf-idf becomes tf * idf = 1*0.29 = 0.29, 0.14, 0.14, 0.14, 0.14, 0.182 for Sentence_1, and 0.33, 0.16, 0.16, 0.16,0.16, 0 for Sentence_2, respectively. <br />
Now we can calculate the cosine similarity through the dot product: <img src="cosine.png"> <br />
Where d2 is the tf-idf vector for Sentence_1 and q is the tf-idf vector for Sentence_2 that we previously calculated. <br/>
 = (0.29 * 0.33 + 0.14 * 0.16 + ...) / (0.29 * 0.29 + 0.14 * 0.14 + ...) (0.33 * 0.33 + 0.16 * 0.16 + ...) <br />
 = 0.182/0.203 <br />
 = approx 0.89 or 0.46 radians <br/> 
 Therefore, the cosine similarity of Sentence_1 and Sentence_2 is approximately 0.89, making them highly similar to one another.  <br />
This process has been coded below with the help of sklearn in order to define our baseline approach. 

In [79]:
def calculate_cosine_similarity(sentence_one, sentence_two):
    #handles the calculation of tfidf and building of term document matrices for us
    tfidf = TfidfVectorizer(preprocessor=' '.join)
    tfidf_matrix = tfidf.fit_transform([sentence_one, sentence_two])
    #calculates the cosine similarity of the pairwise sentences
    similarity = cosine_similarity(tfidf_matrix)[0,1]
    return similarity

In [80]:
def get_cosine_similarity(row):
    sentence_one_tokenize = tokenize_negation(row['Sentence_1'])
    sentence_two_tokenize = tokenize_negation(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    lemmatize_sentence_one = lemmatize(filtered_sentence_one)
    lemmatize_sentence_two = lemmatize(filtered_sentence_two)
    return calculate_cosine_similarity(lemmatize_sentence_one, lemmatize_sentence_two)

In [37]:
#Applying Cosine Similaritye to Quora sentence pairs 
quora_data['Cosine_Similarity'] = quora_data.apply(get_cosine_similarity, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873


In [38]:
#Applying Cosine Similarity Coefficient to MRP Corpus sentence pairs 
mrp_data['Cosine_Similarity'] = mrp_data.apply(get_cosine_similarity, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818


<h3>Evaluation Measures </h3>

We wil be using accuracy, recall, and precision in order to evaluate our models. 

In [78]:
#Calculate how many were right predictions out of the total predicted
def calculate_accuracy(model, actual_tags):
    correct = 0
    total = 0
    for prediction, actual in zip(model, actual_tags):
        total += 1
        if prediction == actual:
            correct += 1 
    accuracy = correct / total 
    return '{0:.1%}'.format(accuracy)
#Keeps track of true positives, true negatives, false positives, and false negatives.
def build_confusion_matrix(actual_tags, model, classOfInterest):
    confusion_matrix = {}
    truePositives = len([p for p, a in zip(model, actual_tags) if p == a and p == classOfInterest])
    trueNegatives = len([p for p, a in zip(model, actual_tags) if p == a and p != classOfInterest])
    falsePositives = len([p for p, a in zip(model, actual_tags) if p != a and p == classOfInterest])
    falseNegatives = len([p for p, a in zip(model, actual_tags) if p != a and p != classOfInterest])
    confusion_matrix["tp"] = truePositives
    confusion_matrix["tn"] = trueNegatives
    confusion_matrix["fp"] = falsePositives 
    confusion_matrix["fn"] = falseNegatives
    return confusion_matrix
#Calculate how many relevant predictions are retrieved in general
def calculate_recall(model, actual_tags, classOfInterest):
    matrix = build_confusion_matrix(model, actual_tags, classOfInterest)
    recall = matrix["tp"] / ( matrix["tp"] + matrix ["fn"])
    return '{0:.1%}'.format(recall)
#Calculate how many retrieved predictions are relevant
def calculate_precision(model, actual_tags, classOfInterest):
    matrix = build_confusion_matrix(model, actual_tags, classOfInterest)
    precision = matrix["tp"]/ (matrix["tp"] + matrix["fp"])
    return '{0:.1%}'.format(precision)

<h3> Baseline Model Threshold </h3>

We now need to decide what similarity value is adequate to determine whether a pair of sentences are semantically equivalent. Are we going to accept them as a paraphrase if the measure is above 0.5? Above 0.8? Instead of randomly picking a threshold for the baseline method, we are going to 'learn' a threshold that yields relatively best results for us in terms of accuracy, recall, and precision.

In [39]:
def apply_cosine_classification(row, **kwargs):
    if(row['Cosine_Similarity'] > kwargs['threshold']):
        classification = 1
    else:
        classification = 0
    return classification

thresholds = [0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9]
quora_data_thresholds = {}
mrp_data_thresholds = {}
accuracies = {}
recalls = {}
precisions = {}
for threshold in thresholds:
    quora_data_thresholds[threshold] = quora_data.apply(apply_cosine_classification, threshold = threshold, axis = 1)
    accuracies[threshold] = calculate_accuracy(quora_data_thresholds[threshold], quora_data['is_Paraphrase'])
    recalls[threshold] = calculate_recall(quora_data_thresholds[threshold], quora_data['is_Paraphrase'], 1)
    precisions[threshold] = calculate_precision(quora_data_thresholds[threshold], quora_data['is_Paraphrase'], 1)
print("Quora Accuracy: {}, \n Quora Recall: {},\nQuora Precision: {}" .format(accuracies, recalls, precisions))

for threshold in thresholds:
    mrp_data_thresholds[threshold] = mrp_data.apply(apply_cosine_classification, threshold = threshold, axis = 1)
    accuracies[threshold] = calculate_accuracy(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'])
    recalls[threshold] = calculate_recall(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'],1)
    precisions[threshold] = calculate_precision(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'], 1)
print("MRP Accuracy: {}, \nMRP Recall: {},\nMRP Precision: {}" .format(accuracies, recalls, precisions))

Quora Accuracy: {0.1: '49.7%', 0.15: '54.0%', 0.2: '58.9%', 0.25: '62.6%', 0.3: '65.4%', 0.35: '67.0%', 0.4: '67.5%', 0.45: '67.8%', 0.5: '67.6%', 0.55: '66.4%', 0.6: '65.8%', 0.65: '66.0%', 0.7: '65.6%', 0.75: '65.5%', 0.8: '65.4%', 0.85: '65.2%', 0.9: '65.4%'}, 
 Quora Recall: {0.1: '43.1%', 0.15: '45.3%', 0.2: '48.0%', 0.25: '50.4%', 0.3: '52.7%', 0.35: '54.5%', 0.4: '55.4%', 0.45: '56.6%', 0.5: '56.7%', 0.55: '56.6%', 0.6: '56.6%', 0.65: '58.2%', 0.7: '58.7%', 0.75: '61.4%', 0.8: '66.2%', 0.85: '71.6%', 0.9: '75.9%'},
Quora Precision: {0.1: '100.0%', 0.15: '99.6%', 0.2: '98.7%', 0.25: '95.1%', 0.3: '88.3%', 0.35: '80.3%', 0.4: '75.4%', 0.45: '66.5%', 0.5: '62.4%', 0.55: '49.5%', 0.6: '42.8%', 0.65: '37.4%', 0.7: '32.1%', 0.75: '24.8%', 0.8: '18.5%', 0.85: '13.9%', 0.9: '13.0%'}
MRP Accuracy: {0.1: '67.4%', 0.15: '67.9%', 0.2: '68.7%', 0.25: '69.1%', 0.3: '69.6%', 0.35: '70.2%', 0.4: '70.9%', 0.45: '69.5%', 0.5: '67.1%', 0.55: '63.8%', 0.6: '59.3%', 0.65: '53.8%', 0.7: '49.3%', 0.75

<h4> Explain why we picked 0.4 </h4>

In [45]:
quora_baseline = {}
quora_baseline['Cosine_Classification'] = quora_data.apply(apply_cosine_classification, threshold=0.4, axis=1)
quora_baseline_df = pd.DataFrame(quora_baseline)
print(quora_baseline_df)

      Cosine_Classification
0                         1
1                         1
2                         0
3                         0
4                         0
...                     ...
4495                      1
4496                      1
4497                      0
4498                      1
4499                      1

[4500 rows x 1 columns]


In [46]:
mrp_baseline = {}
mrp_baseline['Cosine_Classification'] = mrp_data.apply(apply_cosine_classification, threshold = 0.4,  axis=1)
mrp_baseline_df = pd.DataFrame(mrp_baseline)
print(mrp_baseline_df)

      Cosine_Classification
0                         1
1                         0
2                         1
3                         0
4                         0
...                     ...
3944                      1
3945                      1
3946                      1
3947                      0
3948                      1

[3949 rows x 1 columns]


<h3> Measures and Results </h3>

In [81]:
def baseline_evaluation_summary(baseline, df):
    baseline_accuracy = calculate_accuracy(baseline['Cosine_Classification'], df['is_Paraphrase'])
    baseline_recall = calculate_recall(baseline['Cosine_Classification'], df['is_Paraphrase'], 1)
    baseline_precision = calculate_precision(baseline['Cosine_Classification'], df['is_Paraphrase'], 1)
    
    baseline_evaluation_summary = {"Model": ['Baseline'],
                   "Accuracy":[(baseline_accuracy)], 
                   "Recall":[baseline_recall], 
                   "Precision":[baseline_precision]}
    results_df = pd.DataFrame(baseline_evaluation_summary)
    print(results_df)

In [82]:
print('Quora Data Baseline Evaluation')
baseline_evaluation_summary(quora_baseline, quora_data)
print('Microsoft Data Baseline Evaluation')
baseline_evaluation_summary(mrp_baseline, mrp_data)

Quora Data Baseline Evaluation
      Model Accuracy Recall Precision
0  Baseline    67.5%  55.4%     75.4%
Microsoft Data Baseline Evaluation
      Model Accuracy Recall Precision
0  Baseline    70.9%  76.2%     82.8%


<h2>Feature Engineering</h2>

<h3>Syntactic Similarity</h3>

<h4>Edit Distance</h4>

Edit distance is a measure of similarity between two strings, source and target. It is the minimum number of operations (insertion, deletion, or substitution) required to transform the source string to target string. For example, the edit distance between 'monkey' and 'money' is 1. Deletion of 'k' in 'monkey' will give us 'money'.
Here, we have implemented such edit distance but on the word level instead of character level. This was done by transforming the sentences to list and then comparing each element of the source and the target. 

In [53]:
def edit_distance(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    return nltk.edit_distance(sentence_one_tokenize, sentence_two_tokenize)

In [54]:
#Applying Edit Distance to Quora sentence pairs 
quora_data['Edit_distance'] = quora_data.apply(edit_distance, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8


In [55]:
#Applying Edit Distance to MRP Corpus sentence pairs 
mrp_data['Edit_distance'] = mrp_data.apply(edit_distance, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6


<h4>Jaccard Similarity Coefficient</h4>

*Explain Jaccard - smoothing with 1* 

In [56]:
def jaccard_sim_coefficient(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    intersection = 0
    for word_in_one in sentence_one_tokenize:
        for word_in_two in sentence_two_tokenize:
            if word_in_one == word_in_two:
                intersection += 1
    union = (len(sentence_one_tokenize) + len(sentence_two_tokenize) - intersection)
    smoothed_intersection = intersection + 1
    smoothed_union = union + (len(sentence_one_tokenize) + len(sentence_two_tokenize))
    return smoothed_intersection/smoothed_union

In [57]:
#Applying Jaccard Similarity Coefficient to Quora sentence pairs 
quora_data['Jaccard_similarity'] = quora_data.apply(jaccard_sim_coefficient, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212


In [58]:
#Applying Jaccard Similarity Coefficient to MRP Corpus sentence pairs 
mrp_data['Jaccard_similarity'] = mrp_data.apply(jaccard_sim_coefficient, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641


<h4>Sequence Matcher</h4>

*Explain Sequence Matcher*

In [59]:
def sequence_matcher(row):
    return SequenceMatcher(None, row['Sentence_1'], row['Sentence_2']).ratio()

In [60]:
#Applying Sequence Matcher to Quora sentence pairs 
quora_data['Sequence_matcher'] = quora_data.apply(sequence_matcher, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737,0.926829
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162,0.647482
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636,0.454545
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025,0.069565
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947,0.365217
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714,0.659091
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333,0.17284
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571,0.591549
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231,0.852941
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212,0.495413


In [61]:
#Applying Sequence Matcher to MRP Corpus sentence pairs 
mrp_data['Sequence_matcher'] = mrp_data.apply(sequence_matcher, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833,0.653659
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714,0.627027
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852,0.704225
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419,0.616216
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875,0.605128
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857,0.773109
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593,0.505747
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125,0.736842
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154,0.544379
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641,0.832298


<h4>N-gram measure</h4>

*Explain N-gram measure, n=3*

In [62]:
def ngram_measure(row):
    n = 3
    common_count = 1
    grams_sentence_one = ngrams(row['Sentence_1'].split(), n)
    grams_sentence_two = ngrams(row['Sentence_2'].split(), n)
    grams_sentence_one_total = sum(1 for x in grams_sentence_one)
    grams_sentence_two_total = sum(1 for x in grams_sentence_two)
    for gram_in_one in grams_sentence_one:
        if gram_in_one in grams_sentence_two:
            common_count += 1
    return common_count / (grams_sentence_one_total + grams_sentence_two_total - common_count + grams_sentence_one_total + grams_sentence_two_total)

In [63]:
#Applying N-gram Measure to Quora sentence pairs 
quora_data['N-gram_measure'] = quora_data.apply(ngram_measure, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737,0.926829,0.023256
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162,0.647482,0.030303
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636,0.454545,0.025641
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025,0.069565,0.032258
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947,0.365217,0.032258
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714,0.659091,0.018182
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333,0.17284,0.047619
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571,0.591549,0.043478
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231,0.852941,0.043478
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212,0.495413,0.037037


In [64]:
#Applying N-gram Measure to MRP Corpus sentence pairs 
mrp_data['N-gram_measure'] = mrp_data.apply(ngram_measure, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833,0.653659,0.019608
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714,0.627027,0.018182
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852,0.704225,0.015873
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419,0.616216,0.015873
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875,0.605128,0.015385
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857,0.773109,0.013333
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593,0.505747,0.020408
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125,0.736842,0.032258
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154,0.544379,0.021277
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641,0.832298,0.025641


<h3>Semantic Similarity</h3>

<h4> Word Mover's Distance </h4>


*Explain WMD here*

In [65]:
def word_movers_distance(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    distance = word_vectors.wmdistance(filtered_sentence_one, filtered_sentence_two)
    return distance

In [66]:
#Applying Word Mover's Distance to Quora sentence pairs 
quora_data['WMD_distance'] = quora_data.apply(word_movers_distance, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737,0.926829,0.023256,0.960645
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162,0.647482,0.030303,4.814004
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636,0.454545,0.025641,3.439465
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025,0.069565,0.032258,5.829829
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947,0.365217,0.032258,5.070934
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714,0.659091,0.018182,2.846745
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333,0.17284,0.047619,6.943961
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571,0.591549,0.043478,1.980018
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231,0.852941,0.043478,0.0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212,0.495413,0.037037,3.169096


In [67]:
#Applying Word Mover's Distance to MRP Corpus sentence pairs 
mrp_data['WMD_distance'] = mrp_data.apply(word_movers_distance, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833,0.653659,0.019608,0.491286
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714,0.627027,0.018182,2.723919
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852,0.704225,0.015873,1.386057
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419,0.616216,0.015873,2.483421
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875,0.605128,0.015385,2.149025
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857,0.773109,0.013333,1.475687
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593,0.505747,0.020408,4.475639
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125,0.736842,0.032258,3.584339
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154,0.544379,0.021277,4.28246
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641,0.832298,0.025641,0.907146


<h4>Named Entity Recognition Similarity</h4>

*Explain NER with jaccard*

In [68]:
def ner_measure(row):
    ner_sentence_one=[]
    ner_sentence_two=[]
    count_common_ner = 0
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(row['Sentence_1']))):
        if hasattr(chunk, 'label'):
            ner_sentence_one.append(chunk)
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(row['Sentence_2']))):
        if hasattr(chunk, 'label'):
            ner_sentence_two.append(chunk)
    for item in ner_sentence_one:
        if item in ner_sentence_two:
            count_common_ner += 1
    smoothed_intersection = count_common_ner + 1
    union = len(ner_sentence_one) + len(ner_sentence_two) - count_common_ner
    smoothed_union = union + len(ner_sentence_one) + len(ner_sentence_two)
    # smoothed union set as 1 if both sentences have no NER
    if smoothed_union == 0:
        smoothed_union = 1
    return smoothed_intersection / smoothed_union

In [69]:
#Applying NER Measure to Quora sentence pairs 
quora_data['NER_similarity'] = quora_data.apply(ner_measure, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance,NER_similarity
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737,0.926829,0.023256,0.960645,1.0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162,0.647482,0.030303,4.814004,0.285714
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636,0.454545,0.025641,3.439465,0.25
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025,0.069565,0.032258,5.829829,1.0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947,0.365217,0.032258,5.070934,0.5
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714,0.659091,0.018182,2.846745,0.125
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333,0.17284,0.047619,6.943961,1.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571,0.591549,0.043478,1.980018,1.0
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231,0.852941,0.043478,0.0,1.0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212,0.495413,0.037037,3.169096,0.25


In [70]:
#Applying NER Measure to MRP Corpus sentence pairs 
mrp_data['NER_similarity'] = mrp_data.apply(ner_measure, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance,NER_similarity
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833,0.653659,0.019608,0.491286,0.666667
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714,0.627027,0.018182,2.723919,0.3
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852,0.704225,0.015873,1.386057,1.0
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419,0.616216,0.015873,2.483421,0.25
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875,0.605128,0.015385,2.149025,0.666667
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857,0.773109,0.013333,1.475687,0.25
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593,0.505747,0.020408,4.475639,0.25
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125,0.736842,0.032258,3.584339,0.285714
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154,0.544379,0.021277,4.28246,1.0
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641,0.832298,0.025641,0.907146,1.0


<h4>Word Sense Disambiguation</h4>

*Explain WSD using Lesk algorithm*

In [71]:
def wsd(row):
    sentence_one_senses = []
    sentence_two_senses = []
    common_senses = 0
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    for word in sentence_one_tokenize:
        sentence_one_senses.append(lesk(row['Sentence_1'], word))
    for word in sentence_two_tokenize:
        sentence_two_senses.append(lesk(['Sentenece_2'], word))
    sentence_one_senses = (set(sentence_one_senses))
    sentence_two_senses = (set(sentence_two_senses))

    for sense in sentence_one_senses:
        if sense in sentence_two_senses:
            common_senses += 1
    return common_senses / (len(sentence_one_senses) + len(sentence_two_senses) - common_senses)

In [72]:
#Applying WSD Measure to Quora sentence pairs 
quora_data['WSD'] = quora_data.apply(wsd, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance,NER_similarity,WSD
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737,0.926829,0.023256,0.960645,1.0,0.416667
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162,0.647482,0.030303,4.814004,0.285714,0.125
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636,0.454545,0.025641,3.439465,0.25,0.1875
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025,0.069565,0.032258,5.829829,1.0,0.076923
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947,0.365217,0.032258,5.070934,0.5,0.2
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714,0.659091,0.018182,2.846745,0.125,0.166667
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333,0.17284,0.047619,6.943961,1.0,0.125
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571,0.591549,0.043478,1.980018,1.0,0.2
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231,0.852941,0.043478,0.0,1.0,0.333333
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212,0.495413,0.037037,3.169096,0.25,0.428571


In [73]:
#Applying WSD Measure to MRP Corpus sentence pairs 
mrp_data['WSD'] = mrp_data.apply(wsd, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance,NER_similarity,WSD
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833,0.653659,0.019608,0.491286,0.666667,0.4
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714,0.627027,0.018182,2.723919,0.3,0.181818
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852,0.704225,0.015873,1.386057,1.0,0.4
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419,0.616216,0.015873,2.483421,0.25,0.173913
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875,0.605128,0.015385,2.149025,0.666667,0.277778
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857,0.773109,0.013333,1.475687,0.25,0.3
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593,0.505747,0.020408,4.475639,0.25,0.25
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125,0.736842,0.032258,3.584339,0.285714,0.222222
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154,0.544379,0.021277,4.28246,1.0,0.142857
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641,0.832298,0.025641,0.907146,1.0,0.333333


<h4> Wordnet Extended Cosine Similarity </h4>
Here, we will use WordNet's capabilities to extend our sentences and, therefore, extend our dictionary. We will then call our existing calculate_cosine_similarity method to get the similarity of the extended documents.

In [74]:
def extend_sentence_wordnet(row):
    sentence_one_tokenize = tokenize_negation(row['Sentence_1'])
    sentence_two_tokenize = tokenize_negation(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    lemmatize_sentence_one = lemmatize(filtered_sentence_one)
    lemmatize_sentence_two = lemmatize(filtered_sentence_two)
    # get extended synonym list    
    extended_dictionary_one = []
    extended_dictionary_two = []
    for one, two in zip(lemmatize_sentence_one, lemmatize_sentence_two):
        synonym_one = add_synonym(one)
        synonym_two = add_synonym(two)
        hypernym_one = add_hypernym(one)
        hypernym_two = add_hypernym(two)
        if(synonym_one):
            extended_dictionary_one += synonym_one
        if(hypernym_one):
            extended_dictionary_one += hypernym_one
        if(synonym_two):
            extended_dictionary_two += synonym_two
        if(hypernym_two):
            extended_dictionary_two += hypernym_two
    lemmatize_sentence_one += extended_dictionary_one
    lemmatize_sentence_two += extended_dictionary_two
    
    #calculate similarity based on the extended list 
    similarity = calculate_cosine_similarity(lemmatize_sentence_one, lemmatize_sentence_two)
    return similarity

In [75]:
#Applying Synonym_Cosine to Quora sentence pairs 
quora_data['Synonym_Hypernym_Cosine'] = quora_data.apply(extend_sentence_wordnet, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance,NER_similarity,WSD,Synonym_Hypernym_Cosine
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,0.895532,3,0.394737,0.926829,0.023256,0.960645,1.0,0.416667,0.996588
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,0.410995,9,0.162162,0.647482,0.030303,4.814004,0.285714,0.125,0.050784
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,0.225765,11,0.113636,0.454545,0.025641,3.439465,0.25,0.1875,0.640131
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,0.0,11,0.025,0.069565,0.032258,5.829829,1.0,0.076923,0.047164
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,0.168368,12,0.078947,0.365217,0.032258,5.070934,0.5,0.2,0.239692
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,0.422795,11,0.160714,0.659091,0.018182,2.846745,0.125,0.166667,0.367553
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,0.0,11,0.033333,0.17284,0.047619,6.943961,1.0,0.125,0.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,0.336097,5,0.178571,0.591549,0.043478,1.980018,1.0,0.2,0.047231
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,0.709297,2,0.269231,0.852941,0.043478,0.0,1.0,0.333333,0.989762
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,0.380873,8,0.121212,0.495413,0.037037,3.169096,0.25,0.428571,0.547755


In [76]:
mrp_data['Synonym_Hypernym_Cosine'] = mrp_data.apply(extend_sentence_wordnet, axis = 1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Cosine_Similarity,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,WMD_distance,NER_similarity,WSD,Synonym_Hypernym_Cosine
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,0.801978,11,0.270833,0.653659,0.019608,0.491286,0.666667,0.4,0.543801
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,0.339099,14,0.160714,0.627027,0.018182,2.723919,0.3,0.181818,0.241053
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,0.588364,14,0.351852,0.704225,0.015873,1.386057,1.0,0.4,0.577762
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,0.397346,13,0.177419,0.616216,0.015873,2.483421,0.25,0.173913,0.467429
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,0.381322,15,0.171875,0.605128,0.015385,2.149025,0.666667,0.277778,0.541388
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,0.776515,8,0.517857,0.773109,0.013333,1.475687,0.25,0.3,0.759142
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,0.179523,13,0.092593,0.505747,0.020408,4.475639,0.25,0.25,0.117006
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,0.716812,5,0.28125,0.736842,0.032258,3.584339,0.285714,0.222222,0.140652
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,0.252334,10,0.096154,0.544379,0.021277,4.28246,1.0,0.142857,0.328852
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,0.81818,6,0.25641,0.832298,0.025641,0.907146,1.0,0.333333,0.824734


<h2> Neural Network Model </h2>

<h3> Post Processing </h3>

*Explain -- preparing our data for our multilayer perceptron*

In [77]:
max_quora = quora_data.loc[quora_data['WMD_distance'] != np.nan, 'WMD_distance'].max()
quora_data['WMD_distance'].replace(np.nan, max_quora, inplace=True)
x = quora_data[['WMD_distance']].values.astype(float)
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
quora_data['WMD_distance'] = pd.DataFrame(x_scaled)

max_quora = quora_data.loc[quora_data['WMD_distance'] != np.nan, 'WMD_distance'].max()
quora_data['WMD_distance'].replace(np.nan, max_quora, inplace=True)
x = quora_data[['WMD_distance']].values.astype(float)
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
quora_data['WMD_distance'] = pd.DataFrame(x_scaled)

<h3> Multi-layer Perceptron </h3>

In [None]:
def multilayer_perceptron(df):
    properties = list(df.columns.values)
    properties.remove('is_Paraphrase')
    properties.remove('Sentence_1')
    properties.remove('Sentence_2')
    # properties.remove('NER_similarity')
    X = df[properties]
    y = df['is_Paraphrase']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    model = keras.Sequential([
    keras.layers.Flatten(input_shape=(9,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
    ])

    model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

    model.fit(X_train, y_train, epochs=15, batch_size=1)

    test_loss, test_acc, test_pre, test_recall = model.evaluate(X_test, y_test)
    print('Test accuracy:{}, test recall: {}, test precision: {}'.format(test_acc, test_recall, test_pre))

In [None]:
print("Multi-Layer Perceptron Results for quora_data")
multilayer_perceptron(quora_data)
print("\n*********************************************************************************\n")
print("Multi-Layer Perceptron Results for mrp_data")
multilayer_perceptron(mrp_data)

<h2> Results & Evaluation </h2>

<h2> Conclusion </h2>

We hypothesized that syntactic and semantic similarity were both important while predicting if a pair of sentences were paraphrase. We built our features based on this hypothesis, creating both syntactic and semantic features. 

<h2> References </h2>