# Lin (1998)

* **Statistic**: 
    * A Variant of Mutual Information
    * Word Similarity
* **Corpus**:
    * Brown (NLTK)
* **Parsing**:
    * Type: Dependency
    * Library: Spacy
* **Categorization**:
    * All POS

### A. Math

* **Information of a Dependency Triple**

    * $ \begin{align} I(w,r,w') &= -log(P_{MLE}(r)P_{MLE}(w|r)P_{MLE}(w'|r)) - (-logP_{MLE}(w,r,w')) \\
    &= log\frac{||w,r,w'||\times||*,r,*||}{||w,r,*||\times||*,r,w'||} \end{align}$, where $*$ means *summing over all*. (cf. Lin 1998:769)
    

* **Word Similarity**

    * ${SIM}(w_1,w_2) = \frac{\sum_{(r,w)\in T(w_1)\cap T(w_2)}(I(w_1,r,w) + I(w_2,r,w))}{\sum_{(r,w)\in T(w_1)}I(w_1,r,w) + \sum_{(r,w)\in T(w_2)}T(w_2,r,w)}$

### B. Extract Dependency Triples

In [1]:
from nltk.corpus import brown
from spacy.en import English

In [2]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

In [4]:
def dependency_triples():
    
    # extract & parse sentences
    sents = [' '.join(sent) for sent in brown.sents()]
    parser = English()
    parsed_corpus = [parser(sent) for sent in sents]
    
    # triple extraction facilities
    get_triple = lambda s: [(token.orth_,token.dep_,token.head.orth_) for token in s]
    porter = PorterStemmer()
    to_stemmed = lambda (w,dep,h): (porter.stem(w).lower(),dep,porter.stem(h).lower())
    
    # extract triples
    dep_triples = []
    for sent in parsed_corpus:
        dep_triples += get_triple(sent)
    
    # stem triples
    stemmed_dep_triples = map(to_stemmed, dep_triples)
    
    return stemmed_dep_triples


In [6]:
%%time
triples = dependency_triples()

CPU times: user 2min 2s, sys: 2.73 s, total: 2min 5s
Wall time: 2min 6s


### C. Compute Argument Similarities

In [13]:
import numpy as np
from collections import Counter, defaultdict
from __future__ import division

In [14]:
# COMPUTATIONAL FACILITIES
log = lambda x: np.log(x) if x!=0 else 0
div = lambda x,y: x/y if y!=0 else 0.

In [15]:
# COMPUTATIONAL LOOKUPS
c_wrw = Counter(triples)
c_0r0 = Counter(r for _,r,_ in triples)
c_wr0 = Counter((w,r) for w,r,_ in triples)
c_0rw = Counter((r,w) for _,r,w in triples)
Tw = lambda w_i: set((r,w_prime) for w,r,w_prime in triples if w==w_i)
    # r,w_prime pairs, where w_i is the first in triple.

In [17]:
def I(w, r, w_prime):
    i =  log( div( c_wrw[(w,r,w_prime)] * c_0r0[r] , 
                   c_wr0[(w,r)] * c_0rw[(r,w_prime)]) )
    return i if i>=0 else 0 # because we only care about positive Is

def sim(w1,w2):
    w1,w2 = porter.stem(w1),porter.stem(w2)
    Tw1w2 = list(Tw(w1).intersection(Tw(w2)))
    Tw1, Tw2 = list(Tw(w1)), list(Tw(w2))
    num = sum(I(w1,r,w)+I(w2,r,w) for r,w in Tw1w2)
    denom = sum(I(w1,r,w) for r,w in Tw1) + \
            sum(I(w2,r,w) for r,w in Tw2)
    return num/denom

In [24]:
# EXAMPLES (cf. Lin 1998:770)
word = 'brief'
similars = {'n': ['petition','affidavit','motion'],
            'v': ['tell','urge','elect'],
            'adj': ['lengthy','short','recent']}
dissimilars = {'n': ['chicken','water','flower'],
               'v': ['kill','drink','eat'],
               'adj': ['red','evil','big'],
               'other': ['the','that','what']}
print 'Similars: '
for cls in similars.iterkeys():
    print cls
    for w in similars[cls]:
        print '    %s-%s: %.6f' % (word,w,sim(word,w))
print 
print 'Dissimilars: '
for cls in dissimilars.iterkeys():
    print cls
    for w in dissimilars[cls]:
        print '    %s-%s: %.6f' % (word,w,sim(word,w))

Similars: 
v
    brief-tell: 0.005852
    brief-urge: 0.006521
    brief-elect: 0.004259
adj
    brief-lengthy: 0.052643
    brief-short: 0.055053
    brief-recent: 0.028293
n
    brief-petition: 0.057823
    brief-affidavit: 0.015976
    brief-motion: 0.029800

Dissimilars: 
v
    brief-kill: 0.000458
    brief-drink: 0.003995
    brief-eat: 0.000979
adj
    brief-red: 0.004278
    brief-evil: 0.006365
    brief-big: 0.003499
other
    brief-the: 0.000451
    brief-that: 0.000714
    brief-what: 0.003457
n
    brief-chicken: 0.001408
    brief-water: 0.002747
    brief-flower: 0.004911
