# Sentence Similarity Measures III: Wide-Inclusive Sentence Featurization

## 0. Contents

* I. Corpora:
    * MSR Paraphrase Corpus
    * Brown
* II. Discriminativity Weighting (Brown, SpaCy lemmatization)
* III. Featurization:
    * Features:
        * Unigram Prec/Rec (Wan et al. 2006) 
        * Bleu Prec/Rec (Papineni et al. 2002)
        * Dependency Prec/Rec (Wan et al. 2006; Moll$\acute{a}$ 2003; Hovy et al. 2015)
        * F1 for Unigram, Bleu & Dependency
        * Tree Edit Distance (Zhang & Sasha Algorithm)
        * Sentence Lengths (Wan et al. 2006)
    * Featurization Function
* IV. Paraphrase Classifier:
    * Training: MSR Paraphrase Corpus
    * Classifier Types:
        * Logistic
        * SVM
* V. Evaluation

## I. Corpora

In [15]:
import numpy as np
import pandas as pd
from nltk.corpus import brown
from spacy.en import English

In [16]:
parser = English()

##### Load MSR

In [17]:
path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_data.txt"

In [18]:
df = pd.read_csv(path, delimiter='\t')
df.head()

Unnamed: 0,﻿Sentence ID,String,Author,URL,Agency,Date,Web Date
0,702876,"Amrozi accused his brother, whom he called ""th...",Darren Goodsir,www.theage.com.au,*,June 5 2003,2003/06/04
1,702977,"Referring to him as only ""the witness"", Amrozi...",Darren Goodsir,www.smh.com.au,Sydney Morning Herald,June 5 2003,2003/06/04
2,2108705,Yucaipa owned Dominick's before selling the ch...,MICHAEL GIBBS,www.nwherald.com,*,*,2003/08/23
3,2108831,Yucaipa bought Dominick's in 1995 for $693 mil...,ALEX VEIGA,www.miami.com,*,*,2003/08/23
4,1330381,They had published an advertisement on the Int...,Philip Pangalos,www.alertnet.org,*,*,2003/06/25


In [19]:
len(df['String'])

10594

In [20]:
data = list(df['String'])

In [21]:
for i in xrange(5):
    print data[i]
    print

Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.

Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.

Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.

Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.



In [22]:
def parse_msr():
    
    parsed_sents = [parser(unicode(sent.decode('utf8','ignore'))) for sent in data]
    lemma_sents = [[token.lemma_ for token in parsed_sent] 
                   for parsed_sent in parsed_sents]
    
    return lemma_sents

In [23]:
%%time
msr_sents = parse_msr()

CPU times: user 20.1 s, sys: 501 ms, total: 20.6 s
Wall time: 20.7 s


In [24]:
for i in xrange(5):
    print msr_sents[i]
    print

[u'amrozi', u'accuse', u'his', u'brother', u',', u'whom', u'he', u'call', u'"', u'the', u'witness', u'"', u',', u'of', u'deliberately', u'distort', u'his', u'evidence', u'.']

[u'refer', u'to', u'him', u'as', u'only', u'"', u'the', u'witness', u'"', u',', u'amrozi', u'accuse', u'his', u'brother', u'of', u'deliberately', u'distort', u'his', u'evidence', u'.']

[u'yucaipa', u'own', u'dominick', u"'s", u'before', u'sell', u'the', u'chain', u'to', u'safeway', u'in', u'1998', u'for', u'$', u'2.5', u'billion', u'.']

[u'yucaipa', u'buy', u'dominick', u"'s", u'in', u'1995', u'for', u'$', u'693', u'million', u'and', u'sell', u'it', u'to', u'safeway', u'for', u'$', u'1.8', u'billion', u'in', u'1998', u'.']

[u'they', u'have', u'publish', u'an', u'advertisement', u'on', u'the', u'internet', u'on', u'june', u'10', u',', u'offer', u'the', u'cargo', u'for', u'sale', u',', u'he', u'add', u'.']



##### Build MSR Training

In [25]:
# TODO

##### Load Brown

In [3]:
def parse_brown():
    
    sents = brown.sents()
    parsed_sents = [parser(' '.join(sent)) for sent in sents]
    lemma_words = [token.lemma_ for parsed_sent in parsed_sents for token in parsed_sent]
    
    return lemma_words

In [4]:
%%time
brown_words = parse_brown()

CPU times: user 1min 38s, sys: 770 ms, total: 1min 39s
Wall time: 1min 39s


In [5]:
N = len(brown_words)
N

1188973

## I. Discriminativity Weighting (IDF)

**Math**

* $IDF(w) = log\frac{N}{df_w}$, where $N$ is the number of words in a corpus; $df_w$ is word $w$'s frequency in the corpus.

In [90]:
from __future__ import division

In [91]:
log = lambda x: np.log(x) if x>0 else 0 
    # intuitively N > word_count(w) for any w,
    #  therefore we cannot let idf(w) be negative
    #  even when word_count(w) = 0 for w.
div = lambda x,y: x/y if y!=0 else 0

In [92]:
def idf(w):
    
    return log(div(N,brown_words.count(w)))

In [93]:
print "'the': ", idf('the')
print "'discriminate': ", idf('discriminate')

'the':  2.83230709
'discriminate':  12.0426903182


## IIa. Features

### A.  Unigram Prec/Rec

**Math**

* $Uni\_Prec(s_1,s_2) = \frac{word\_overlap(s_1,s_2)\cdot \left(\sum_{w\in s_1\cap s_2}log\frac{N}{df_w}\right)}{word\_count(s_1)}$ (cf. Wan et al. 2006:133, weighted by $IDF$)


* $Uni\_Rec(s_1,s_2) = \frac{word\_overlap(s_1,s_2)\cdot \left(\sum_{w\in s_1\cap s_2}log\frac{N}{df_w}\right)}{word\_count(s_2)}$ (cf. ibid.)

In [42]:
intersection = lambda s1,s2: set(s1).intersection(set(s2))
word_overlap = lambda s1,s2: len(intersection(s1,s2))
lemmatize = lambda s: [token.lemma_ for token in parser(' '.join(s))]

In [43]:
def uni_prec(s1, s2, lemmatized=False): # s1,s2 assumed to be lists of words
    
    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)

    return div(word_overlap(s1,s2) * \
               sum(idf(w) for w in intersection(s1,s2)),
               len(s1))


In [38]:
s0 = msr_sents[0]
s1 = msr_sents[1] # known to be the paraphrase of q
s2 = msr_sents[2] # known to not be the paraphrase of q

In [49]:
%%time
print uni_prec(s0,s1)
print uni_prec(s0,s2)

52.7638330069
0.621903467176
CPU times: user 415 ms, sys: 1.24 ms, total: 417 ms
Wall time: 418 ms


In [40]:
def uni_rec(s1, s2, lemmatized=False):
    
    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)    
    
    return div(word_overlap(s1,s2) * \
               sum(idf(w) for w in intersection(s1,s2)),
               len(s2))

In [48]:
%%time
print uni_rec(s0,s1)
print uni_rec(s0,s2)

50.1256413566
0.695068580961
CPU times: user 418 ms, sys: 1.89 ms, total: 420 ms
Wall time: 421 ms


### B. BLEU Prec/Rec

**NB** (cf. Wan et al. 2006:133)

* "... Bleu metric uses the geometric average of unigram, bigram and trigram precision scores."
* "... by reversing [two sentences], ... a recall version of Bleu is obtained."

In [44]:
from nltk import bleu

In [52]:
def bleu_prec(s1, s2, lemmatized=False): # s1 as the 'hypothesis'
    
    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2) 
    
    return bleu(s2,s1)

In [53]:
def bleu_rec(s1, s2, lemmatized=False): # s2 as the 'hypothesis'
    
    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2) 
    
    return bleu(s1,s2)

In [50]:
%%time
print bleu_prec(s0,s1)
print bleu_prec(s0,s2)

0.630364741336
0.478973625444
CPU times: user 5.02 ms, sys: 1.96 ms, total: 6.98 ms
Wall time: 5.28 ms


In [51]:
%%time
print bleu_rec(s0,s1)
print bleu_rec(s0,s2)

0.622332977288
0.492479060505
CPU times: user 4.68 ms, sys: 2.22 ms, total: 6.9 ms
Wall time: 5.17 ms


### C. Dependency Prec/Rec

**Math**

* $Dep\_Prec(s_1,s_2) = \frac{|dep\_pair(s_1)|\cap|dep\_pair(s_2)|}{|dep\_pair(s_1)|}$ (cf. Wan et al. 2006:134)


* $Dep\_Rec(s_1,s_2) = \frac{|dep\_pair(s_1)|\cap|dep\_pair(s_2)|}{|dep\_pair(s_2)|}$ (cf. ibid.)

**NB**: $relation$ in the reference confuses *dependency pair* with *dependency relation*. $relation$ refers to "... a pair of words in a parent-child relationship within the dependency tree, referred to as head-modifier relationship. ... we ignore the label of the relationships which indicates the semantic role".

In [106]:
dep_pairs = lambda parsed_s: {(token.head.lemma_,token.lemma_) for token in parsed_s
                              if token.head.lemma_!=token.lemma_} # eliminte (v, ROOT, v) cases

In [107]:
def dep_prec(s1, s2, lemmatized=False):
    
    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)    
    parsed_s1, parsed_s2 = parser(' '.join(s1)), parser(' '.join(s2))
    
    dep_pairs_s1, dep_pairs_s2 = dep_pairs(parsed_s1), dep_pairs(parsed_s2)
    
    return div(len(dep_pairs_s1.intersection(dep_pairs_s2)),
               len(dep_pairs_s1))
        

In [108]:
def dep_rec(s1, s2, lemmatized=False):
    
    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)    
    parsed_s1, parsed_s2 = parser(' '.join(s1)), parser(' '.join(s2))
    
    dep_pairs_s1, dep_pairs_s2 = dep_pairs(parsed_s1), dep_pairs(parsed_s2)
    
    return div(len(dep_pairs_s1.intersection(dep_pairs_s2)),
               len(dep_pairs_s2))


In [109]:
%%time
print dep_prec(s0,s1)
print dep_prec(s0,s2)

0.588235294118
0.0
CPU times: user 7.35 ms, sys: 615 µs, total: 7.97 ms
Wall time: 7.1 ms


In [110]:
%%time
print dep_rec(s0,s1)
print dep_rec(s0,s2)

0.555555555556
0.0
CPU times: user 7.26 ms, sys: 401 µs, total: 7.66 ms
Wall time: 7.07 ms


### D. F1

**Math**

* $F1 = 2\cdot\frac{prec\cdot rec}{prec + rec}$ (cf. https://en.wikipedia.org/wiki/F1_score)

In [111]:
def f1_unigram(s1, s2, lemmatized=False):

    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)    
    prec, rec = uni_prec(s1,s2), uni_rec(s1,s2)
    
    return 2 * div(prec*rec,prec+rec)

In [112]:
def f1_bleu(s1, s2, lemmatized=False):

    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)    
    prec, rec = bleu_prec(s1,s2), bleu_rec(s1,s2)
    
    return 2 * div(prec*rec,prec+rec)

In [113]:
def f1_dep(s1, s2, lemmatized=False):

    if lemmatized:
        s1, s2 = lemmatize(s1), lemmatize(s2)    
    prec, rec = dep_prec(s1,s2), dep_rec(s1,s2)
    
    return 2 * div(prec*rec,prec+rec)

In [114]:
%%time
print f1_unigram(s0,s1)
print f1_unigram(s0,s2)

51.5021731165
0.666901083707
CPU times: user 864 ms, sys: 2.73 ms, total: 867 ms
Wall time: 869 ms


In [115]:
%%time
print f1_bleu(s0,s1)
print f1_bleu(s0,s2)

0.626323111188
0.485632464611
CPU times: user 8.74 ms, sys: 4.1 ms, total: 12.8 ms
Wall time: 10 ms


In [116]:
%%time
print f1_dep(s0,s1)
print f1_dep(s0,s2)

0.571428571429
0
CPU times: user 11.4 ms, sys: 542 µs, total: 11.9 ms
Wall time: 11.3 ms


### E. Tree Edit Distance

In [175]:
from zss import simple_distance, Node 
    # use zss.distance if dynamic tree modification is needed. 
    #  cf. zss api: pythonhosted.org/zss/.

In [190]:
get_root = lambda parsed_s: Node([token.lemma_ for token in parsed_s if token.dep_=='ROOT'][0])

In [191]:
def make_zss_tree(node, dep_pairs):
    
    for dep_pair in dep_pairs:
        if node.label==dep_pair[0]:
            kid = make_zss_tree(Node(dep_pair[1]), dep_pairs)
            node.addkid(kid)
    
    return node

In [192]:
def tree_edit_dist(s1, s2):
    
    parsed_s1, parsed_s2 = parser(' '.join(s1)), parser(' '.join(s2))
    root_s1, root_s2 = get_root(parsed_s1), get_root(parsed_s1)
    dep_pairs_s1, dep_pairs_s2 = dep_pairs(parsed_s1), dep_pairs(parsed_s2)
    
    tree_s1, tree_s2 = make_zss_tree(root_s1,dep_pairs_s1), \
                       make_zss_tree(root_s2,dep_pairs_s2)
    
    return simple_distance(tree_s1, tree_s2)
    

In [194]:
%%time
print tree_edit_dist(s0,s1)
print tree_edit_dist(s0,s2)

11
17
CPU times: user 11.2 ms, sys: 721 µs, total: 11.9 ms
Wall time: 10.8 ms


##### Step-by-Step Walkthrough of Tree Building

**a. Parse Sample Sents**

In [133]:
sample1 = u'i ate a big mac'
sample2 = u'i ate a small mac'
parsed_sample1 = parser(sample1)
parsed_sample2 = parser(sample2)
print dep_pairs(parsed_sample1)
print dep_pairs(parsed_sample2)

set([(u'mac', u'a'), (u'eat', u'mac'), (u'mac', u'big'), (u'eat', u'i')])
set([(u'mac', u'a'), (u'eat', u'mac'), (u'eat', u'i'), (u'mac', u'small')])


In [134]:
for token in parsed_sample1:
    print token.head.lemma_,token.dep_,token.lemma_

eat nsubj i
eat ROOT eat
mac det a
mac amod big
eat dobj mac


**b. Make Tree & Check for Samples**

In [159]:
test1 = make_zss_tree(Node('eat'),dep_pairs(parsed_sample1))

In [160]:
print test1

2:eat
2:mac
0:a
0:big
0:i


In [161]:
print test1.get_children(test1)
mac, i = test.get_children(test1)

[<zss.simple_tree.Node object at 0x11a661d50 mac>, <zss.simple_tree.Node object at 0x11a661c90 i>]


In [162]:
print mac.get_children(mac)
a, big = mac.get_children(mac)

[<zss.simple_tree.Node object at 0x11a661c10 a>, <zss.simple_tree.Node object at 0x11a661c50 big>]


In [163]:
print i.get_children(i)

[]


In [164]:
print a.get_children(a)
print big.get_children(big)

[]
[]


In [165]:
test2 = make_zss_tree(Node('eat'),dep_pairs(parsed_sample2))

In [172]:
print test2

2:eat
2:mac
0:a
0:small
0:i


In [166]:
print test2.get_children(test2)
mac, i = test.get_children(test2)

[<zss.simple_tree.Node object at 0x11a661d90 mac>, <zss.simple_tree.Node object at 0x11a661610 i>]


In [168]:
print mac.get_children(mac)
a, small = mac.get_children(mac)

[<zss.simple_tree.Node object at 0x11a661090 a>, <zss.simple_tree.Node object at 0x11a6617d0 small>]


In [170]:
print i.get_children(i)

[]


In [169]:
print a.get_children(a)
print small.get_children(small)

[]
[]


**c. Compute Edit Distance**

In [171]:
simple_distance(test1,test2)

1

### F. Sentence Lengths

* "... the difference in length of two sentences ... measured in words by subtracting one length from the other." (cf. Wan et al. 2006:134)
* "... this difference could be a negative or positive integer ... an absolute variant was used." (cf. ibid.)

In [196]:
def sent_len_diffs(s1, s2):
    
    diff = len(s1)-len(s2)
    
    return diff, abs(diff)

In [197]:
print sent_len_diffs(s0,s1)
print sent_len_diffs(s0,s2)

(-1, 1)
(2, 2)


## IIb: Featurization Function