# Erk (2007)

* **Statistic**: Selectional Preference Strength
* **Corpus**: Brown
* **Parsing**: 
    * Type: Dependency
    * Library: Spacy
* **Categorization**:
    * Noun, Verb
    * Noun Subject, Noun Direct Object, Verb

**NB**: 
* This method seems to work better with large corpora, and less desirable on small ones compared to Resnik (1996).
* For convenience, only one corpus is used here. In practice, it is recommended that a *primary corpus* and a *generalization corpus* are used.

### A. Get Subj/Obj Lists for Verbs 

In [389]:
from spacy.en import English
from nltk.corpus import brown
from collections import defaultdict

In [390]:
def extract_nv_info():
    """
        returns an nv_count dict: {nsubj:[(n,v):count,...],dobj:[(n,v):count,...]}
        returns a v-arg dict: {v:{nsubj:[...],dobj:[...]},...}
    """
    
    # build parsing facilities
    parser = English()
    nv_counts = {'nsubj':defaultdict(int),'dobj':defaultdict(int)}
    dic = defaultdict(lambda : defaultdict(list))
    def dictionarize(parsed):
        for token in parsed:
            if token.dep_ in ['nsubj','dobj'] \
                and w2t[token.head.orth_]=='VERB' \
                and w2t[token.orth_]=='NOUN':
                nv_counts[token.dep_][(token.lemma_,token.head.lemma_)] += 1
                if token.lemma_ not in dic[token.head.lemma_][token.dep_]:
                    dic[token.head.lemma_][token.dep_].append(token.lemma_)
                
    # extract info from corpus
    tagged_words = list(set(brown.tagged_words(tagset='universal')))
    w2t = defaultdict(lambda : 'UNK', {w:t for w,t in tagged_words})
    sents = [' '.join(sent) for sent in brown.sents()]
    
    # parse corpus
    parsed_sents = [parser(sent) for sent in sents]
    for parsed in parsed_sents:
        dictionarize(parsed)
    
    return nv_counts, dic

In [391]:
%%time
nv_counts, dic = extract_nv_info()

CPU times: user 1min 46s, sys: 1.99 s, total: 1min 48s
Wall time: 1min 48s


### B. Similarity Measure by Paradigmatic Co-Subj/Obj

**NB**: This *similarity* is not the typical *distributional similarity*, in the sense that two words appear in similar contexts in general (e.g. some word-window). Instead, two words being similar here simply means they appear a lot as the subject/object of same verbs. *bread* and *boat* may be similar because they both appear as the object of the verb *sell*, but they are not similar in any intuitive sense.

**Math**

* **Association**: $PPMI(a,b) = \begin{cases} PMI(a,b) \quad &\text{ if } \geq 0 \\ 0 \quad &\text{ else } \end{cases} $, where $PMI(a,b) = log\frac{P(a,b)}{P(a)\cdot P(b)}$


* **Similarity**: $Cosine(u,v) = \frac{\sum_iu_iv_i}{\sqrt{\sum_iu_i^2}\sqrt{\sum_iv_i^2}}$


In [44]:
import numpy as np
from __future__ import division

In [382]:
def iszero(m): return m==0 # handling logging-on-0 or /0 issue.

def ppmi(n2v):
    row_sums, col_sums, total_sums = n2v.sum(axis=1), n2v.sum(axis=0), n2v.sum()
    pn, pv, ppmi_m = row_sums/total_sums, col_sums/total_sums, n2v/total_sums
    pn[iszero(pn)] = 1e-10 # NB: this is an expedient, may need to reconsider.
    pv[iszero(pv)] = 1e-10
    ppmi_m /= pn[:,np.newaxis] # * 1/pwi by row.
    ppmi_m /= pv # * 1/pwj by col.
    ppmi_m[iszero(ppmi_m)] = 1e-10 
    ppmi_m = np.log(ppmi_m) # compute pmi.
    ppmi_m = np.maximum(ppmi_m, 0) # compute ppmi.
    return ppmi_m

def cosine(n2v):
    n2v_norm = n2v / np.apply_along_axis(lambda r: np.sqrt(np.dot(r,r)), 1, n2v)[:,np.newaxis]
    return np.dot(n2v_norm, n2v_norm.T)

In [377]:
class DistributionalModel:
    
    def __init__(self, nv_counts):
        self.nvc = nv_counts
        print "... building vocabs"
        self._build_vocabs()
        print "... building n2v matrices"
        self._build_n2v_matrices()
        print "... building similarity matrices"
        self._build_similarity_matrices()
        print "READY!"
    
    def _build_vocabs(self):
        self.nsubj_vocab = list({n for n,v in self.nvc['nsubj'].iterkeys()})
        self.dobj_vocab = list({n for n,v in self.nvc['dobj'].iterkeys()})
        self.v_vocab = list({v for n,v in self.nvc['nsubj']}.union({v for n,v in self.nvc['dobj']}))
        self.s2i = {s:i for i,s in enumerate(self.nsubj_vocab)}
        self.o2i = {o:i for i,o in enumerate(self.dobj_vocab)}
        self.v2i = {v:i for i,v in enumerate(self.v_vocab)}
    
    def _build_n2v_matrices(self):
        self.nsubj_n2v = np.zeros((len(self.s2i),len(self.v2i)),dtype=float)
        self.dobj_n2v = np.zeros((len(self.o2i),len(self.v2i)),dtype=float)
        for (n,v),count in self.nvc['nsubj'].iteritems():
            self.nsubj_n2v[self.s2i[n]][self.v2i[v]] = count
        for (n,v),count in self.nvc['dobj'].iteritems():         
            self.dobj_n2v[self.o2i[n]][self.v2i[v]] = count
    
    def _build_similarity_matrices(self):
        self.nsubj_sim_m = cosine(ppmi(self.nsubj_n2v))
        self.dobj_sim_m = cosine(ppmi(self.dobj_n2v))
    
    def nsubj_sim(self, n1, n2):
        try: return self.nsubj_sim_m[self.s2i[n1]][self.s2i[n2]]
        except: print "Out of Vocabulary Words!"
    
    def dobj_sim(self, n1, n2):
        try: return self.dobj_sim_m[self.o2i[n1]][self.o2i[n2]]
        except: print "Out of Vocabulary Words!"
    
    def nsubj_top_k(self, n, k=10, reverse=False):
        try: n_idx = self.s2i[n]
        except: 
            print "Out of Vocabulary Words!"
            return
        if reverse:
            return map(lambda idx: self.nsubj_vocab[idx],
                       np.argsort(self.nsubj_sim_m[n_idx])[:k])
        return map(lambda idx: self.nsubj_vocab[idx],
                   np.argsort(self.nsubj_sim_m[n_idx])[::-1][1:k+1])

    def dobj_top_k(self, n, k=10, reverse=False):
        try: n_idx = self.o2i[n]
        except: 
            print "Out of Vocabulary Words!"
            return
        if reverse:
            return map(lambda idx: self.dobj_vocab[idx],
                       np.argsort(self.dobj_sim_m[n_idx])[:k])
        return map(lambda idx: self.dobj_vocab[idx],
                   np.argsort(self.dobj_sim_m[n_idx])[::-1][1:k+1])        
    

In [378]:
%%time
sim = DistributionalModel(nv_counts)

... building vocabs
... building n2v matrices
... building similarity matrices
READY!
CPU times: user 16 s, sys: 867 ms, total: 16.8 s
Wall time: 4.86 s


### C. Compute Selectional Preference Strength

**Math**

* **Selectional Preference Strength**: 
    * $S_v(w_0) = \sum_{w\in Seen(v)} SIM(w_0,w) \cdot wt_v(w)$, where $Seen(v)$ is the set of seen nsubj/dobj of verb $v$.

* **Weighting**:
    * Uniform: $wt_v(w) = 1$
    * By Frequency: $wt_v(w) = f(w,v)$
    * By Discriminativity: $wt_v(w) = log\frac{\text{num. words}}{\text{num. words to whose context w belongs}}$

##### Weight Computing

In [397]:
n2v = {'s':sim.nsubj_n2v, 'o':sim.dobj_n2v} 
w2i = {'s':sim.s2i, 'o':sim.o2i, 'v':sim.v2i}

In [None]:
log = lambda x: np.log(x) if x>0 else np.log(1e-20)
div = lambda x,y: 0. if y==0 else x/y

In [None]:
def wt(w, v, mode='s', wt_type='uniform'):
    w_idx, v_idx = w2i[mode][w], w2i['v'][v]
    if wt_type=='freq':
        return n2v[mode][w_idx][v_idx]
    elif wt_type=='disc':
        num_w = sum(n2v[mode][w_idx,:])
        num_context = sum(1 if cell!=0 else 0 for cell in n2v[mode][w_idx,:])
        return log(div(num_w, num_context))
    else:
        return 1

##### Selectional Preference Strength with Various Weightings

In [448]:
sim_dic = {'s':sim.nsubj_sim, 'o':sim.dobj_sim}

In [449]:
def sps(w, v, mode='s', wt_type='uniform'):
    
    arg_vec = dic[v]['nsubj' if mode=='s' else 'dobj'] # dic is precomputed in section A.
    return sum(sim_dic[mode](w,n) * wt(n,v,mode,wt_type) 
               for n in arg_vec)

In [463]:
words = ['salad','chicken','jury','court']
verb = 'eat'
print "Verb 'eat''s Direct Object SPS\n"
for word in words:
    print "%s-%s\nUniform Wt.: %.6f | Freq. Wt.: %.6f | Disc. Wt.: %.6f \n" % \
          (verb,word,sps(word,verb,mode='o'),
                     sps(word,verb,mode='o',wt_type='freq'),
                     sps(word,verb,mode='o',wt_type='disc'))

Verb 'eat''s Direct Object SPS

eat-salad
Uniform Wt.: 8.286084 | Freq. Wt.: 8.686217 | Disc. Wt.: 0.756887 

eat-chicken
Uniform Wt.: 5.264083 | Freq. Wt.: 6.379131 | Disc. Wt.: 0.643172 

eat-jury
Uniform Wt.: 0.042213 | Freq. Wt.: 0.042213 | Disc. Wt.: 0.031482 

eat-court
Uniform Wt.: 0.078737 | Freq. Wt.: 0.078737 | Disc. Wt.: 0.029188 

