# Sentence Similarity Measures II: KB Syn-Sem Family

## 0. Contents

* I. Corpora (SpaCy Preprocessing for Lemmatization)
    * MSR Paraphrase Corpus (for evaluation)
    * Brown Corpus (for computing info content of words)
    * WordNet (for computing word similarity)
* II. Word Similarity
* III. Sentence Similarity
* IV. Word-Order Similarity
* V. Overall Sentence Similarity (Linear Combination of Sent & Word Similarities)

## I. Corpora

* MSR Paraphrase Corpus
* NLTK WordNet

### A. MSR Preprocessing

##### Load

In [44]:
import pandas as pd

In [45]:
path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_data.txt"

In [46]:
df = pd.read_csv(path, delimiter='\t')
df.head()

Unnamed: 0,﻿Sentence ID,String,Author,URL,Agency,Date,Web Date
0,702876,"Amrozi accused his brother, whom he called ""th...",Darren Goodsir,www.theage.com.au,*,June 5 2003,2003/06/04
1,702977,"Referring to him as only ""the witness"", Amrozi...",Darren Goodsir,www.smh.com.au,Sydney Morning Herald,June 5 2003,2003/06/04
2,2108705,Yucaipa owned Dominick's before selling the ch...,MICHAEL GIBBS,www.nwherald.com,*,*,2003/08/23
3,2108831,Yucaipa bought Dominick's in 1995 for $693 mil...,ALEX VEIGA,www.miami.com,*,*,2003/08/23
4,1330381,They had published an advertisement on the Int...,Philip Pangalos,www.alertnet.org,*,*,2003/06/25


In [47]:
len(df['String'])

10594

In [48]:
data = list(df['String'])
for i in xrange(5):
    print data[i]
    print

Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.

Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.

Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.

Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.

They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.



##### To Lemmas

In [49]:
from spacy.en import English

In [50]:
parser = English()

In [166]:
def parse_msr():
    
    parsed_sents = [parser(unicode(sent.decode('utf8','ignore'))) for sent in data]
    lemma_sents = [[token.lemma_ for token in parsed_sent] 
                   for parsed_sent in parsed_sents]
    lemma_words = [lemma for lemma_sent in lemma_sents for lemma in lemma_sent]
    
    return lemma_sents, lemma_words

In [167]:
%%time
msr_sents, msr_words = parse_msr()

CPU times: user 15.7 s, sys: 60 ms, total: 15.7 s
Wall time: 15.7 s


In [168]:
print msr_sents[0]
print
print msr_words[:10]

[u'amrozi', u'accuse', u'his', u'brother', u',', u'whom', u'he', u'call', u'"', u'the', u'witness', u'"', u',', u'of', u'deliberately', u'distort', u'his', u'evidence', u'.']

[u'amrozi', u'accuse', u'his', u'brother', u',', u'whom', u'he', u'call', u'"', u'the']


### B. Brown Preprocessing

In [169]:
from nltk.corpus import brown

In [170]:
def parse_brown():
    
    sents = brown.sents()
    parsed_sents = [parser(' '.join(sent)) for sent in sents]
    lemma_words = [token.lemma_ for parsed_sent in parsed_sents for token in parsed_sent]
    
    return lemma_words

In [171]:
%%time
brown_words = parse_brown()

CPU times: user 1min 36s, sys: 715 ms, total: 1min 37s
Wall time: 1min 38s


In [179]:
N = len(brown_words)
N

1188973

### C. WordNet

In [173]:
from nltk.corpus import wordnet as wn

## II. Word Similarity with Knowledge Base

**Math**

* **Li et al. (2006)'s WordNet Word Similarity**
    * Equation: $SIM(w_1,w_2) = e^{-\alpha l}\cdot \frac{e^{\beta h}-e^{-\beta h}}{e^{\beta h}+e^{-\beta h}}$ (cf. ibid.:14,(5)).
    * Breakdown: The similarity between $w_1$ and $w_2$ is the product of the following functions:
        * Path Length Function: $f(l) = e^{-\alpha l}$
        * Subsumer Depth Function: $g(h) = \frac{e^{\beta h}-e^{-\beta h}}{e^{\beta h}+e^{-\beta h}}$
    * Measures:
        * Path Length: (cf. ibid.:13)
            * $0$ if $w_1$ and $w_2$ are in the same synset.
            * $1$ if $w_1$ and $w_2$ are not in the same synset but the synset for $w_1$ and $w_2$ contain one or more common words.
            * *shortest path length* according to WordNet if neither of the above is true.
        * Subsumer Depth: (cf. ibid.:14)
            * "Words at upper layers of hierarchical semantic nets have more general concepts and less semantic similarity between words than words at lower layers. Therefore $g(h)$ should increase monotonically with respect to the subsumer depth".

In [160]:
import numpy as np

In [161]:
lemmas = lambda synset: frozenset(str(lemma.name()) for lemma in synset.lemmas()
                         if '_' not in str(lemma.name())) # there are lemmas like 'domestic_dog'.
div = lambda x,y: x/y if y!=0 else 0

In [231]:
def path_len(w1, w2):
    
    w1synsets, w2synsets = wn.synsets(w1), wn.synsets(w2)
    w1syns = {lemmas(syn) for syn in w1synsets}
    w2syns = {lemmas(syn) for syn in w2synsets}
    
    for syn in w1syns.union(w2syns):
        if w1 in syn and w2 in syn:
            return 0
    for w1syn in w1syns:
        for w2syn in w2syns:
            if w1syn.intersection(w2syn):
                return 1
    pls = []
    for w1syn in w1synsets:
        for w2syn in w2synsets:
            pl = w1syn.shortest_path_distance(w2syn)
            if pl!=None: pls.append(pl)

    return 50 if len(pls)==0 else min(pls) # to penalize non-related words
          

In [232]:
%%time
path_len('dog','cat')

CPU times: user 8.86 ms, sys: 2.66 ms, total: 11.5 ms
Wall time: 9.43 ms


4

In [238]:
def subsumer_depth(w1, w2):
    
    w1synsets, w2synsets = wn.synsets(w1), wn.synsets(w2)
    subsumers = []
    for w1syn in w1synsets:
        for w2syn in w2synsets:
            subsumers += w1syn.common_hypernyms(w2syn)
    subsumers = list(set(subsumers))
    
    depths = [subsumer.min_depth() for subsumer in subsumers] 
    
    return 0 if len(depths)==0 else max(depths) # penalizes no-subsumer case.


In [237]:
%%time
subsumer_depth('dog','cat')

CPU times: user 889 µs, sys: 355 µs, total: 1.24 ms
Wall time: 996 µs


11

In [162]:
def word_sim(w1, w2, alpha=.2, beta=.45):
    
    l, h = path_len(w1,w2), subsumer_depth(w1,w2)
    
    return np.exp(-alpha*l) * \
           div(np.exp(beta*h)-np.exp(-beta*h), \
               np.exp(beta*h)+np.exp(-beta*h))
    

In [164]:
print word_sim('dog','cat')
print word_sim('dog','canine')

0.449283876504
0.818697350358


## III. Sentence Similarity

**Math**

* **Sentence Vector $\check{s}$**:
    * Build a vector template $\check{s}$ the cells of which correspond to the set of distinctive words in two sentences $s_1$, $s_2$, i.e. $\{w|w\in s_1\cup s_2\}$.
    * For $s_1$ and $s_1$, build their vector $\check{s}_1$ and $\check{s}_2$ as follows: for each $w$ in $\check{s}$,
        * If $w$ appears in a sentence, set $\check{s}_{1/2,i} = 1$
        * Otherwise, compute $w$'s similarities to all the words in $\check{s}_{1/2}$, and set $\check{s}_{1/2,i}$ to be the highest similarity value resulted.
    * Each cell of $\check{s}_{1/2,i}$ is weighted by the corresponding word $w_i$'s *Information Content*, which is computed with $I(w) = \frac{logp(w)}{log(N+1)} = 1 - \frac{log(n+1)}{log(N+1)}$, where $n$ is the frequence of $w$ in a corpus (Brown, in this case), $N$ is the size of the corpus. The normalization: $\check{s}_i = \check{s}_i\cdot I(w_i)\cdot I(\tilde{w}_i)$, where $\tilde{w}_i$ is the word entry that is associated with $w_i$ (i.e. either itself, when $w$ is found in a sentence, and $w$'s most similar word otherwise). 


* **Sentence Similarity**:
    * Equation: $SIM(s_1,s_2) = \frac{\check{s}_1\cdot\check{s}_2}{||\check{s}_1||\cdot||\check{s}_2||}$.
    * I.e. Cosine Similarity

In [178]:
log = lambda x: np.log(x) if x>0 else np.log(1e-20)

In [268]:
I = lambda w: 1 - div(log(brown_words.count(w)+1),log(N+1))

In [292]:
def vec(s1, s2): # assuming s1,s2 are lists of words.
    
    s_check = list(set(s1).union(set(s2)))
    l_check = len(s_check)
    s1_check, s2_check = np.zeros(l_check), np.zeros(l_check)
    for i,w in enumerate(s_check):
        if w in s1: s1_check[i] = 1
        else: 
            idx,most_sim = max(enumerate(s1), key=lambda (j,w_j):word_sim(w,w_j)) # idx: that of w's most sim.
            s1_check[i] = word_sim(w,most_sim) * I(w) * I(s1[idx]) # weight by info content
        if w in s2: s2_check[i] = 1
        else: 
            idx,most_sim = max(enumerate(s2), key=lambda (j,w_j):word_sim(w,w_j)) 
            s2_check[i] = word_sim(w,most_sim) * I(w) * I(s2[idx])
    
    return s1_check, s2_check


In [307]:
def sent_sim(s1, s2):
    
    s1_vec, s2_vec = vec(s1, s2)
    
    return div(np.dot(s1_vec,s2_vec),
               np.sqrt(np.dot(s1_vec,s1_vec)) * \
               np.sqrt(np.dot(s2_vec,s2_vec)))


In [319]:
q = msr_sents[0]
r1 = msr_sents[1] # known to be the paraphrase pairmate to q.
r2 = msr_sents[2] # know to be not the paraphrase pairmate to q.

In [320]:
%%time
print sent_sim(q, r1)
print sent_sim(q, r2)

0.797807014451
0.185647424791
CPU times: user 2.54 s, sys: 18 ms, total: 2.56 s
Wall time: 2.56 s


## IV. Word-Order Similarity

**Math**

* **Order Similarity**:
    * Equation: $SIM(s_1,s_2) = 1 - \frac{||r_1 - r_2||}{||r_1 + r_2||}$ (cf. Li et al. (2006):18,(8)).
    * Breakdown: Word order vectors $r_1$ and $r_2$ are computed as follows:
        * Build vector template $\check{s}$ as in section III.
        * For $s_1$ and $s_2$, build word order vectors. For each $w$ in $\check{s}$,
            * If $w$ is found in $s_{1/2}$, set $r_{1/2,i}$ to be 1.
            * Otherwise, set $r_{1/2,i}$ to be the index of the $w$'s most similar word in $s_{1/2}$.
    * Idea: "... normalized difference of word order" (cf. ibid.)

In [314]:
def order_vec(s1, s2):
    
    s_check = list(set(s1).union(set(s2)))
    l_check = len(s_check)
    r1, r2 = np.zeros(l_check), np.zeros(l_check)    
    for i,w in enumerate(s_check):
        if w in s1:
            r1[i] = s1.index(w)
        else:
            most_sim = max(s1, key=lambda w_j:word_sim(w,w_j)) 
            r1[i] = s1.index(most_sim)
        if w in s2:
            r2[i] = s2.index(w)
        else:
            most_sim = max(s2, key=lambda w_j:word_sim(w,w_j)) 
            r2[i] = s2.index(most_sim)   
            
    return r1, r2


In [327]:
def order_sim(s1, s2):
    
    r1, r2 = order_vec(s1, s2)
    
    diff = r1 - r2
    norm = r1 + r2
    
    return 1 - div(np.sqrt(np.dot(diff,diff)),np.sqrt(np.dot(norm,norm)))


In [330]:
%%time
print order_sim(q,r1)
print order_sim(q,r2)

0.702183520613
0.452824344835
CPU times: user 542 ms, sys: 13.5 ms, total: 556 ms
Wall time: 549 ms


## V. Overall Sentence Similarity

**Math**

* $SIM(s_1,s_2) = \delta\cdot SIM_{sent}(s_1,s_2) + (1-\delta)\cdot SIM_{order}(s_1,s_2)$.
* $\delta \in (0.5,1]$, considering word order's "... subordinate role in semantic processing".  

In [331]:
def overall_sent_sim(s1, s2, delta=.85): # delta is a value between [.5,1]. (cf. Li et al. (2006):20,24)
    
    return delta*sent_sim(s1,s2) + (1-delta)*order_sim(s1,s2)

In [332]:
%%time
print overall_sent_sim(q,r1)
print overall_sent_sim(q,r2)

0.783463490376
0.225723962797
CPU times: user 3.3 s, sys: 28.9 ms, total: 3.33 s
Wall time: 3.33 s
