 <h1> Paraphrase Detection with Neural Networks - Natural Language Understanding </h1>


CSI4106 Artificial Intelligence Project <br/>
Prepared by Abha Sharma & Rupsi Kaushik

<h2> Background </h2>

With the growing trends of virtual assistants and chatbots, Natural Language Processing (NLP) is a topic that is becoming increasingly popular in the recent years. From Google AI's Transformer-based models that consider a word's double-sided context to IBM's training data generator, today we have cutting edge approaches to solving NLP tasks.  However, even with these latest breakthroughs, NLP still faces many challenges, namely the problem of accurately deciphering what humans mean when they express something, regardless of how they express it. This problem falls under Natural Language Understanding (NLU), a subtopic of NLP that aims to increase the proficiency of intelligent systems in exhibiting real knowledge of natural language. Within this field, the task of paraphrase detection - determining whether a pair of sentences convey identical meaning - is considered to be an important one. Through the improvement of paraphrase detection, other NLP tasks that are integral to the efficiency of existing intelligent systems, such as question answering, information retrieval, and text summarization, can also be improved. For this reason, in this report, we propose to enhance the capability of neural networks in the context of paraphrase detection through the use of traditional Information Retrieval (IR) techniques as input features. 


*add and edit as you want*

<h2> Objectives </h2>

The main objective of this report is to evaluate the performance of a neural network model given different IR features. Additionally, it will take a look at how the number of features and hidden layers improve the overall performance of the model. These results will be compared among two different training sets that have been annotated for paraphrase detection. Below is the proposed architecture for our particular neural network: 

*adding the diagram & add and edit as you want* 

<h2> Datasets </h2>

We will be working with the Quora Question Pairs and Microsoft Research Paraphrase Corpus datasets for this project. You can find them in this folder labelled as 'msr_train.csv' and 'questions_train.csv'

In [3]:
#Make sure to import all these modules
import pandas as pd 
from pyemd import emd
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk.wsd import lesk
from nltk import ngrams
from difflib import SequenceMatcher
from gensim.models import Word2Vec
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
word_vectors = api.load("glove-wiki-gigaword-100")
from sklearn.model_selection import train_test_split

In [28]:
# Taking a look at the Quora Question Pairs dataset
quora_data = pd.read_csv("questions_train.csv", error_bad_lines=False)
quora_data.Sentence_1 = quora_data.Sentence_1.astype(str)
quora_data.Sentence_2 = quora_data.Sentence_2.astype(str)
quora_data = quora_data[:4000]
quora_data.is_Paraphrase = quora_data.is_Paraphrase.astype(int)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,How can I be a good geologist?,What should I do to be a great geologist?,1
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


In [5]:
#Taking a look at the Microsoft Research Paraphrase dataset 
mrp_data = pd.read_csv("msr_train.csv")
mrp_data.Sentence_1 = mrp_data.Sentence_1.astype(str)
mrp_data.Sentence_2 = mrp_data.Sentence_2.astype(str)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1


<h2> Preprocessing & Transformation </h2>

In [58]:
def remove_stop_words(sentence):
    stop_words = stopwords.words('english')
    processed_sentence = [word for word in sentence if not word in stop_words]
    return processed_sentence     
def tokenize(sentence):
    tokenized_sentence = sentence.lower().split()
    return tokenized_sentence
def lemmatize(sentence):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = [lemmatizer.lemmatize(word) for word in sentence]
    return lemmatized_sentence
def add_synonym(word):
    synonym_list = []
    for syn in wordnet.synsets(word):
        for name in syn.lemma_names():
             synonym_list.append(name.split(".")[0].replace('_',' '))
    return list(set(synonym_list))
def add_antonym(word):
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            if(lemma.antonyms()):
                return lemma.antonyms()[0].name()
            else:
                return None            
def tokenize_negation(sentence):
    negation_adverbs = ["no", "without","not", "n't", "never", "neith", "nor"]
    tokens_with_negation = []
    tokenized_sentence = tokenize(sentence)
    i = 0
    while i < (len(tokenized_sentence)):
        if (i != len(tokenized_sentence)-1) and (tokenized_sentence[i] in negation_adverbs):
            negation_token = add_antonym(tokenized_sentence[i+1])
            if(negation_token):
                tokens_with_negation.append(negation_token)
                i += 2 
            else:
                tokens_with_negation.append(tokenized_sentence[i])
                i +=1
        else:
            tokens_with_negation.append(tokenized_sentence[i])
            i += 1
    return tokens_with_negation

<h2>Feature Engineering</h2>

<h3>Syntactic Similarity</h3>

<h4>Edit Distance</h4>

*Explain Edit Distance* 

In [30]:
def edit_distance(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    return nltk.edit_distance(sentence_one_tokenize, sentence_two_tokenize)

In [31]:
#Applying Edit Distance to Quora sentence pairs 
quora_data['Edit_distance'] = quora_data.apply(edit_distance, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8


In [32]:
#Applying Edit Distance to MRP Corpus sentence pairs 
mrp_data['Edit_distance'] = mrp_data.apply(edit_distance, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111


<h4>Jaccard Similarity Coefficient</h4>

*Explain Jaccard*

In [33]:
def jaccard_sim_coefficient(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    common_count = 0
    for word_in_one in sentence_one_tokenize:
        for word_in_two in sentence_two_tokenize:
            if word_in_one == word_in_two:
                common_count += 1
    total = (len(sentence_one_tokenize) + len(sentence_two_tokenize) - common_count)
    if total == 0:
        total = 0.001
    return common_count/total

In [9]:
#Applying Jaccard Similarity Coefficient to Quora sentence pairs 
quora_data['Jaccard_similarity'] = quora_data.apply(jaccard_sim_coefficient, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,1.166667
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.3125
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.2
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.111111
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.333333
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.333333
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.6
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.2


In [34]:
#Applying Jaccard Similarity Coefficient to MRP Corpus sentence pairs 
mrp_data['Jaccard_similarity'] = mrp_data.apply(jaccard_sim_coefficient, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111


<h4>Sequence Matcher</h4>

*Explain Sequence Matcher*

In [35]:
def sequence_matcher(row):
    return SequenceMatcher(None, row['Sentence_1'], row['Sentence_2']).ratio()

In [36]:
#Applying Sequence Matcher to Quora sentence pairs 
quora_data['Sequence_matcher'] = quora_data.apply(sequence_matcher, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Sequence_matcher
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,0.926829
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.647482
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.454545
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.069565
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.365217
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.659091
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.17284
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.591549
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.852941
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.495413


In [14]:
#Applying Sequence Matcher to MRP Corpus sentence pairs 
mrp_data['Sequence_matcher'] = mrp_data.apply(sequence_matcher, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298


<h4>N-gram measure</h4>

*Explain N-gram measure, n=3*

In [37]:
def ngram_measure(row):
    n = 3
    common_count = 1
    grams_sentence_one = ngrams(row['Sentence_1'].split(), n)
    grams_sentence_two = ngrams(row['Sentence_2'].split(), n)
    grams_sentence_two_total = sum(1 for x in grams_sentence_two)
    if grams_sentence_two_total == 0:
        grams_sentence_two_total = 0.001
    for gram_in_one in grams_sentence_one:
        if gram_in_one in grams_sentence_two:
            common_count += 1
    return common_count / grams_sentence_two_total

In [15]:
#Applying N-gram Measure to Quora sentence pairs 
quora_data['N-gram_measure'] = quora_data.apply(ngram_measure, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,1.166667,0.926829,0.1
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.3125,0.647482,0.090909
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.2,0.454545,0.125
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.0,0.069565,0.142857
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.111111,0.365217,0.2
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.333333,0.659091,0.071429
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.0,0.17284,0.111111
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.333333,0.591549,0.142857
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.6,0.852941,0.166667
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.2,0.495413,0.142857


In [38]:
#Applying N-gram Measure to MRP Corpus sentence pairs 
mrp_data['N-gram_measure'] = mrp_data.apply(ngram_measure, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111


<h4> Pairwise Cosine Similarity </h4>

*Explain*

In [59]:
def get_cosine_similarity(row):
    sentence_one_tokenize = tokenize_negation(row['Sentence_1'])
    sentence_two_tokenize = tokenize_negation(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    #lemmatize to get the root words 
    lemmatize_sentence_one = lemmatize(filtered_sentence_one)
    lemmatize_sentence_two = lemmatize(filtered_sentence_two)
    return calculate_cosine_similarity(lemmatize_sentence_one, lemmatize_sentence_two)
def calculate_cosine_similarity(sentence_one, sentence_two):
    tfidf = TfidfVectorizer(preprocessor=' '.join)
    tfidf_matrix = tfidf.fit_transform([sentence_one, sentence_two])
    similarity = cosine_similarity(tfidf_matrix)[0,1]
    return similarity

In [60]:
#Applying Cosine Similaritye to Quora sentence pairs 
quora_data['Cosine_Similarity'] = quora_data.apply(get_cosine_similarity, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Sequence_matcher,Cosine_Similarity
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,0.926829,0.895532
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.647482,0.410995
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.454545,0.225765
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.069565,0.0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.365217,0.168368
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.659091,0.400001
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.17284,0.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.591549,0.336097
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.852941,0.709297
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.495413,0.380873


In [61]:
#Applying Cosine Similarity to MRP Corpus sentence pairs 
mrp_data['Cosine_Similarity'] = mrp_data.apply(get_cosine_similarity, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,Cosine_Similarity
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429,0.801978
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625,0.392181
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625,0.588364
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923,0.448422
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625,0.377462
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455,0.776515
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909,0.215283
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111,0.716812
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333,0.237944
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111,0.81818


<h3>Semantic Similarity</h3>

<h4> Word Mover's Distance </h4>


*Explain WMD here*

In [62]:
def word_movers_distance(row):
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    distance = word_vectors.wmdistance(filtered_sentence_one, filtered_sentence_two)
    return distance

In [63]:
#Applying Word Mover's Distance to Quora sentence pairs 
quora_data['WMD_distance'] = quora_data.apply(word_movers_distance, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Sequence_matcher,Cosine_Similarity,WMD_distance
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,0.926829,0.895532,1.014917
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.647482,0.410995,5.469145
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.454545,0.225765,2.738648
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.069565,0.0,6.045803
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.365217,0.168368,6.361994
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.659091,0.400001,3.305969
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.17284,0.0,6.220479
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.591549,0.336097,3.960037
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.852941,0.709297,0.0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.495413,0.380873,2.001968


In [22]:
#Applying Word Mover's Distance to MRP Corpus sentence pairs 
mrp_data['WMD_distance'] = mrp_data.apply(word_movers_distance, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,Cosine_Similarity,WMD_distance
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429,0.801978,1.671967
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625,0.392181,3.627347
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625,0.588364,2.164189
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923,0.448422,2.167538
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625,0.377462,3.536683
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455,0.776515,1.271385
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909,0.215283,4.911093
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111,0.716812,4.094435
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333,0.237944,6.319602
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111,0.81818,1.05798


<h4>Named Entity Recognition Similarity</h4>

*Explain NER with jaccard*

In [64]:
def ner_measure(row):
    ner_sentence_one=[]
    ner_sentence_two=[]
    count_common_ner = 0
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(row['Sentence_1']))):
        if hasattr(chunk, 'label'):
            ner_sentence_one.append(chunk)
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(row['Sentence_2']))):
        if hasattr(chunk, 'label'):
            ner_sentence_two.append(chunk)
    for item in ner_sentence_one:
        if item in ner_sentence_two:
            count_common_ner += 1
    return count_common_ner / (len(ner_sentence_one) + len(ner_sentence_two) + 0.001 - count_common_ner)

In [66]:
#Applying NER Measure to Quora sentence pairs 
quora_data['NER_Similarity'] = quora_data.apply(ner_measure, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Sequence_matcher,Cosine_Similarity,WMD_distance,NER_Similarity
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,0.926829,0.895532,1.014917,0.0
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.647482,0.410995,5.469145,0.333222
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.454545,0.225765,2.738648,0.0
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.069565,0.0,6.045803,0.0
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.365217,0.168368,6.361994,0.0
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.659091,0.400001,3.305969,0.0
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.17284,0.0,6.220479,0.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.591549,0.336097,3.960037,0.0
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.852941,0.709297,0.0,0.0
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.495413,0.380873,2.001968,0.0


In [67]:
#Applying NER Measure to MRP Corpus sentence pairs 
mrp_data['NER_similarity'] = mrp_data.apply(ner_measure, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,Cosine_Similarity,NER_similarity
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429,0.801978,0.999001
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625,0.392181,0.499875
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625,0.588364,0.0
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923,0.448422,0.0
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625,0.377462,0.999001
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455,0.776515,0.0
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909,0.215283,0.0
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111,0.716812,0.333222
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333,0.237944,0.0
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111,0.81818,0.0


<h4>Word Sense Disambiguation</h4>

*Explain WSD using Lesk algorithm*

In [68]:
def wsd(row):
    sentence_one_senses = []
    sentence_two_senses = []
    common_senses = 0
    sentence_one_tokenize = tokenize(row['Sentence_1'])
    sentence_two_tokenize = tokenize(row['Sentence_2'])
    for word in sentence_one_tokenize:
        sentence_one_senses.append(lesk(row['Sentence_1'], word))
    for word in sentence_two_tokenize:
        sentence_two_senses.append(lesk(['Sentenece_2'], word))
    sentence_one_senses = (set(sentence_one_senses))
    sentence_two_senses = (set(sentence_two_senses))

    for sense in sentence_one_senses:
        if sense in sentence_two_senses:
            common_senses += 1
    return common_senses / (len(sentence_one_senses) + len(sentence_two_senses) - common_senses)

In [69]:
#Applying WSD Measure to Quora sentence pairs 
quora_data['WSD'] = quora_data.apply(wsd, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Sequence_matcher,Cosine_Similarity,WMD_distance,NER_Similarity,WSD
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,0.926829,0.895532,1.014917,0.0,0.416667
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.647482,0.410995,5.469145,0.333222,0.125
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.454545,0.225765,2.738648,0.0,0.1875
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.069565,0.0,6.045803,0.0,0.076923
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.365217,0.168368,6.361994,0.0,0.2
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.659091,0.400001,3.305969,0.0,0.166667
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.17284,0.0,6.220479,0.0,0.125
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.591549,0.336097,3.960037,0.0,0.2
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.852941,0.709297,0.0,0.0,0.333333
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.495413,0.380873,2.001968,0.0,0.428571


In [25]:
#Applying WSD Measure to MRP Corpus sentence pairs 
mrp_data['WSD'] = mrp_data.apply(wsd, axis=1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,Cosine_Similarity,WMD_distance,WSD
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429,0.801978,1.671967,0.4
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625,0.392181,3.627347,0.181818
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625,0.588364,2.164189,0.4
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923,0.448422,2.167538,0.173913
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625,0.377462,3.536683,0.277778
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455,0.776515,1.271385,0.3
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909,0.215283,4.911093,0.25
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111,0.716812,4.094435,0.222222
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333,0.237944,6.319602,0.142857
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111,0.81818,1.05798,0.333333


<h4> Synonym Extended Cosine Similarity </h4>
Here, we will use WordNet's capabilities to extend our dictionary. We will then call our existing calculate_cosine_similarity method to get the similarity of the extended documents. 

In [71]:
def extend_sentence_synonym(row):
    sentence_one_tokenize = tokenize_negation(row['Sentence_1'])
    sentence_two_tokenize = tokenize_negation(row['Sentence_2'])
    filtered_sentence_one = remove_stop_words(sentence_one_tokenize)
    filtered_sentence_two = remove_stop_words(sentence_two_tokenize)
    #lemmatize to get the root words 
    lemmatize_sentence_one = lemmatize(filtered_sentence_one)
    lemmatize_sentence_two = lemmatize(filtered_sentence_two)
    # get extended synonym list    
    extended_dictionary_one = []
    extended_dictionary_two = []
    for one, two in zip(lemmatize_sentence_one, lemmatize_sentence_two):
        synonym_one = add_synonym(one)
        synonym_two = add_synonym(two)
        if(synonym_one):
            extended_dictionary_one += synonym_one
        if(synonym_two):
            extended_dictionary_two += synonym_two
    lemmatize_sentence_one += extended_dictionary_one
    lemmatize_sentence_two += extended_dictionary_two
    
    #calculate similarity based on the extended list 
    similarity = calculate_cosine_similarity(lemmatize_sentence_one, lemmatize_sentence_two)
    return similarity

In [72]:
#Applying Synonym_Cosine to Quora sentence pairs 
quora_data['Synonym_Cosine'] = quora_data.apply(extend_sentence_synonym, axis=1)
quora_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Sequence_matcher,Cosine_Similarity,WMD_distance,NER_Similarity,WSD,Synonym_Cosine
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,3,0.926829,0.895532,1.014917,0.0,0.416667,0.939115
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,9,0.647482,0.410995,5.469145,0.333222,0.125,0.0439
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,11,0.454545,0.225765,2.738648,0.0,0.1875,0.640131
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,11,0.069565,0.0,6.045803,0.0,0.076923,0.055971
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,12,0.365217,0.168368,6.361994,0.0,0.2,0.08218
5,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1,11,0.659091,0.400001,3.305969,0.0,0.166667,0.350542
6,Should I buy tiago?,What keeps childern active and far from phone ...,0,11,0.17284,0.0,6.220479,0.0,0.125,0.0
7,How can I be a good geologist?,What should I do to be a great geologist?,1,5,0.591549,0.336097,3.960037,0.0,0.2,0.012248
8,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0,2,0.852941,0.709297,0.0,0.0,0.333333,0.989762
9,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0,8,0.495413,0.380873,2.001968,0.0,0.428571,0.566105


In [73]:
mrp_data['Synonym_Cosine'] = mrp_data.apply(extend_sentence_synonym, axis = 1)
mrp_data.head(20)

Unnamed: 0,Sentence_1,Sentence_2,is_Paraphrase,Edit_distance,Jaccard_similarity,Sequence_matcher,N-gram_measure,Cosine_Similarity,NER_similarity,Synonym_Cosine
0,"Amrozi accused his brother, whom he called ""th...","Referring to him as only ""the witness"", Amrozi...",1,11,0.666667,0.653659,0.071429,0.801978,0.999001,0.392325
1,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...,0,14,0.333333,0.627027,0.0625,0.392181,0.499875,0.089886
2,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an...",1,14,1.0,0.704225,0.0625,0.588364,0.0,0.693234
3,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ...",0,13,0.384615,0.616216,0.076923,0.448422,0.0,0.487594
4,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...,1,15,0.37037,0.605128,0.0625,0.377462,0.999001,0.473946
5,Revenue in the first quarter of the year dropp...,With the scandal hanging over Stewart's compan...,1,8,2.0,0.773109,0.045455,0.776515,0.0,0.829344
6,"The Nasdaq had a weekly gain of 17.27, or 1.2 ...",The tech-laced Nasdaq Composite .IXIC rallied ...,0,13,0.16,0.505747,0.090909,0.215283,0.0,0.167325
7,The DVD-CCA then appealed to the state Supreme...,The DVD CCA appealed that decision to the U.S....,1,5,0.666667,0.736842,0.111111,0.716812,0.333222,0.191799
8,"That compared with $35.18 million, or 24 cents...",Earnings were affected by a non-recurring $8 m...,0,10,0.166667,0.544379,0.083333,0.237944,0.0,0.045961
9,He said the foodservice pie business doesn't f...,The foodservice pie business does not fit our ...,1,6,0.6,0.832298,0.111111,0.81818,0.0,0.831294


<h4> Hypernym Extended Similarity </h4>
Checking if words are near by in hypernym hierarchy  <br/>
Wordnet hierarchy augmented with probabilities  <br />
Then calculate shortest path based on similarity


<h2> Train Set & Test Set </h2>

For training with less bias, we want to maintain a reasonably equal class distribution. We want to make sure that we have a balanced class of non-paraphrase, as well as paraphrase pairs of sentences. If this is not possible, as with most real-life scenarios, we need to think of a way to adjust our metric for evaluation in order to accomodate for this imbalance. 

In [None]:
# print("Mrp Data")
# print(mrp_data['is_Paraphrase'].value_counts())
print("Quora Data")
print(quora_data['is_Paraphrase'].value_counts())

<h2> Model </h2>

<h3> Baseline Model </h3>

Firstly, instead of randomly picking a threshold for the baseline method, we are going to 'learn' a threshold that yields best results for us in terms of accuracy, recall, and precision. 

In [170]:
def learn_threshold(row, **kwargs):
    if(row['Cosine_Similarity'] > kwargs['threshold']):
        classification = 1
    else:
        classification = 0
    return classification

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
quora_data_thresholds = {}
mrp_data_thresholds = {}
accuracies = {}
recalls = {}
precisions = {}
for threshold in thresholds:
    quora_data_thresholds[threshold] = quora_data.apply(learn_threshold, threshold = threshold, axis = 1)
    accuracies[threshold] = calculate_accuracy(quora_data_thresholds[threshold], quora_data['is_Paraphrase'])
    recalls[threshold] = calculate_recall(quora_data_thresholds[threshold], quora_data['is_Paraphrase'], 1)
    precisions[threshold] = calculate_precision(quora_data_thresholds[threshold], quora_data['is_Paraphrase'], 1)
print("Quora Accuracy: {}, \n Quora Recall: {},\nQuora Precision: {}" .format(accuracies, recalls, precisions))

for threshold in thresholds:
    mrp_data_thresholds[threshold] = mrp_data.apply(learn_threshold, threshold = threshold, axis = 1)
    accuracies[threshold] = calculate_accuracy(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'])
    recalls[threshold] = calculate_recall(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'],1)
    precisions[threshold] = calculate_precision(mrp_data_thresholds[threshold], mrp_data['is_Paraphrase'], 1)
print("MRP Accuracy: {}, \nMRP Recall: {},\nMRP Precision: {}" .format(accuracies, recalls, precisions))

Quora Accuracy: {0.1: '50.0%', 0.2: '60.2%', 0.3: '66.4%', 0.4: '67.5%', 0.5: '67.5%', 0.6: '65.8%', 0.7: '65.7%', 0.8: '65.2%', 0.9: '65.3%'}, 
 Quora Recall: {0.1: '43.1%', 0.2: '48.7%', 0.3: '53.5%', 0.4: '55.4%', 0.5: '56.8%', 0.6: '56.9%', 0.7: '59.5%', 0.8: '65.5%', 0.9: '76.1%'},
Quora Precision: {0.1: '100.0%', 0.2: '98.5%', 0.3: '86.5%', 0.4: '72.0%', 0.5: '58.6%', 0.6: '39.7%', 0.7: '29.1%', 0.8: '16.7%', 0.9: '12.0%'}
MRP Accuracy: {0.1: '67.4%', 0.2: '68.8%', 0.3: '70.0%', 0.4: '70.9%', 0.5: '66.5%', 0.6: '58.3%', 0.7: '48.1%', 0.8: '39.0%', 0.9: '34.2%'}, 
MRP Recall: {0.1: '67.6%', 0.2: '68.7%', 0.3: '71.1%', 0.4: '76.7%', 0.5: '82.0%', 0.6: '88.0%', 0.7: '91.6%', 0.8: '96.1%', 0.9: '98.6%'},
MRP Precision: {0.1: '99.4%', 0.2: '98.8%', 0.3: '93.6%', 0.4: '81.7%', 0.5: '64.7%', 0.6: '44.3%', 0.7: '25.6%', 0.8: '10.2%', 0.9: '2.7%'}


In [74]:
def apply_cosine_classification(row):
    if (row['Cosine_Similarity'] > 0.5):
        classification = 1
    else:
        classification = 0
    return classification

In [75]:
quora_baseline = {}
quora_baseline['Cosine_Classification'] = quora_data.apply(apply_cosine_classification, axis=1)
print(quora_baseline)

{'Cosine_Classification': 0       1
1       0
2       0
3       0
4       0
       ..
3995    0
3996    0
3997    0
3998    1
3999    1
Length: 4000, dtype: int64}


In [76]:
baseline = {}
baseline['Cosine_Classification'] = mrp_data.apply(apply_cosine_classification, axis=1)
print(baseline)

{'Cosine_Classification': 0       1
1       0
2       1
3       0
4       0
       ..
3944    0
3945    0
3946    1
3947    1
3948    1
Length: 3949, dtype: int64}


<h3> Multi-layer Perceptron </h3>

In [32]:
properties = list(mrp_data.columns.values)
properties.remove('is_Paraphrase')
properties.remove('Sentence_1')
properties.remove('Sentence_2')
print(properties)
X = mrp_data[properties]
y = mrp_data['is_Paraphrase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(8,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

model.fit(X_train, y_train, epochs=10, batch_size=1)

test_loss, test_acc, test_pre, test_recall = model.evaluate(X_test, y_test)
print('Test accuracy:{}, test recall: {}, test precision: {}'.format(test_acc, test_recall, test_pre ))

['Edit_distance', 'Jaccard_similarity', 'Sequence_matcher', 'N-gram_measure', 'Cosine_Similarity', 'WMD_distance', 'WSD', 'Synonym_Cosine']
Train on 2764 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy:0.7164556980133057, test recall: 0.7782002687454224, test precision: 0.792258083820343


In [33]:
properties = list(quora_data.columns.values)
properties.remove('is_Paraphrase')
properties.remove('Sentence_1')
properties.remove('Sentence_2')
X = quora_data[properties]
y = quora_data['is_Paraphrase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

quora_model = keras.Sequential([
    keras.layers.Flatten(input_shape=(6,)),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(7, activation=tf.nn.relu),
    keras.layers.Dense(1, activation=tf.nn.sigmoid),
])

quora_model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', keras.metrics.Precision(), keras.metrics.Recall()])

quora_model.fit(X_train, y_train, epochs=10, batch_size=1)

test_loss, test_acc, test_pre, test_recall = quora_model.evaluate(X_test, y_test)
print('Test accuracy:{}, test recall: {}, test precision: {}'.format(test_acc, test_recall, test_pre ))

Train on 2800 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy:0.6191666722297668, test recall: 0.0, test precision: 0.0


<h2> Results & Evaluation </h2>

In [77]:
def calculate_accuracy(model, actual_tags):
    correct = 0
    total = 0
    for prediction, actual in zip(model, actual_tags):
        total += 1
        if prediction == actual:
            correct += 1 
    accuracy = correct / total 
    return '{0:.1%}'.format(accuracy)
def build_confusion_matrix(actual_tags, model, classOfInterest):
    confusion_matrix = {}
    truePositives = len([p for p, a in zip(model, actual_tags) if p == a and p == classOfInterest])
    trueNegatives = len([p for p, a in zip(model, actual_tags) if p == a and p != classOfInterest])
    falsePositives = len([p for p, a in zip(model, actual_tags) if p != a and p == classOfInterest])
    falseNegatives = len([p for p, a in zip(model, actual_tags) if p != a and p != classOfInterest])
    confusion_matrix["tp"] = truePositives
    confusion_matrix["tn"] = trueNegatives
    confusion_matrix["fp"] = falsePositives 
    confusion_matrix["fn"] = falseNegatives
    return confusion_matrix
def calculate_recall(model, actual_tags, classOfInterest):
    matrix = build_confusion_matrix(model, actual_tags, classOfInterest)
    recall = matrix["tp"] / ( matrix["tp"] + matrix ["fn"])
    return '{0:.1%}'.format(recall)
def calculate_precision(model, actual_tags, classOfInterest):
    matrix = build_confusion_matrix(model, actual_tags, classOfInterest)
    precision = matrix["tp"]/ (matrix["tp"] + matrix["fp"])
    return '{0:.1%}'.format(precision)

In [78]:
quora_baseline_accuracy = calculate_accuracy(quora_baseline['Cosine_Classification'], quora_data['is_Paraphrase'])
quora_baseline_recall = calculate_recall(quora_baseline['Cosine_Classification'], quora_data['is_Paraphrase'], 1)
quora_baseline_precision = calculate_precision(quora_baseline['Cosine_Classification'], quora_data['is_Paraphrase'], 1)

In [79]:
baseline_accuracy = calculate_accuracy(baseline['Cosine_Classification'], mrp_data['is_Paraphrase'])
baseline_recall = calculate_recall(baseline['Cosine_Classification'], mrp_data['is_Paraphrase'], 1)
baseline_precision = calculate_precision(baseline['Cosine_Classification'], mrp_data['is_Paraphrase'], 1)

In [80]:
baseline_evaluation_summary = {"Model": ['Baseline'],
                   "Accuracy":[(baseline_accuracy)], 
                   "Recall":[baseline_recall], 
                   "Precision":[baseline_precision]}
results_df = pd.DataFrame(baseline_evaluation_summary)
print(results_df)

      Model Accuracy Recall Precision
0  Baseline    66.5%  82.0%     64.7%


In [81]:
quora_baseline_evaluation_summary = {"Model": ['Quora'],
                   "Accuracy":[(quora_baseline_accuracy)], 
                   "Recall":[quora_baseline_recall], 
                   "Precision":[quora_baseline_precision]}
quora_results_df = pd.DataFrame(quora_baseline_evaluation_summary)
print(quora_results_df)

   Model Accuracy Recall Precision
0  Quora    67.5%  56.8%     58.6%


<h2> References </h2>