## Model II

* **Norms**: McRae et al. (2005)
* **Model**: Dinu & Lapata (2010) Extension 2
    * Main Idea:
        * The meaning of a target word $t$ is modeled by a distribution over topics $z$.
        * An input "document" to the topic model that produces the distribution of $t$ is generated from $t$'s selected set of contexts $C = \{c_1,...,c_n\}$ and the contexts' frequencies.
        * When the selected contexts are the paradigmatic subjects of $t$, we get a $t$-$z_{subj}$ matrix, when they are the paradigmatic objects of $t$, we get a $t$-$z_{obj}$ matrix.
        * If we were able to compute $p(prop\mid topic)$, then we will be able to obtain a $t$-$prop_{subj}$ matrix and a $t$-$prop_{obj}$ matrix.
        * We can also compute a "general distributional matrix" $t$-$c$ by taking a $[-2,+2]$ word-window (w/ tfidf or PMI conversion).
        * Given a new word $w$, let $t$ a paradigmatic subject/object of $w$, the similarity of $w$ and $t$ is computed with $SIM(w,t) = \lambda sim_1(w,t) + (1-\lambda) sim_2(w,t)$, where $sim_1$ is computed using the paradigmatic matrix (i.e. $t$-$z_{subj}$ or $t$-$z_{obj}$, depending on whether $w$ and $t$ are paradigmatic subjects/objects), and $sim_2$ is computed using the general distributional matrix $t$-$c$. $\lambda,\lambda\in[0,1]$ be a hyperparameter.
    * Algorithm:
        * [GOAL] = Find a distribution over properties for an input word $w$.
        * For $t$s in a set of selected target words, obtain 3 cooccurrence matrices:
            * $t$-$subj$ matrix: $t$'s paradigmatic subject cohort,
            * $t$-$obj$ matrix, $t$'s paradigmatic object cohort,
            * $t$-$c$ matrix, $t$'s $[-2,+2]$ word-window contexts.
        * Compute paradigmatic matrices as follows:
            * Topic Model using the DL10 "pseudo-documents" using $t$-$subj$ matrix and $t$-$obj$ matrix, producing $t$-$z_{subj}$ matrix and $t$-$z_{obj}$ matrix,
            * Using $p(prop\mid z) = \sum_{t\in topic\,z}p(t\mid z)f(t,prop)$, compute $t$-$prop_{subj}$ matrix and $t$-$prop_{obj}$ matrix.
        * Compute general distributional matrix $t$-$c$ using either tfidf or PMI conversion.
        * Find $w$'s property distribution as follows:
            * Given a sentence where $w$ appears, using the predicate $v$ of the sentence, find $w$'s cohort (either subject/object) $T$,
            * For $t_i\in T$, compute $\lambda sim_1(w,t_i) + (1-\lambda) sim_2(w,t_i)$.
            * $w$'s property distribution will be the average of $SIM(w,t_i),t_i\in T$.
    * Similarity Functions:
        * $Cosine(w,t) = \frac{w\cdot t}{|w||t|}$.
        * $SKLD(w,t) = \frac{1}{2}\left(\sum_iw(i)log\frac{w(i)}{t(i)} + \sum_it(i)log\frac{t(i)}{w(i)}\right)$, (cf. Griffiths & Steyvers 2004:10(5)).
    * Property Indicator Functions:
        * $f_{binary}(t,prop) = \begin{cases}1 & \text{if word has prop}\\ 0 & \text{otherwise}\end{cases}$.
        * $f_{stochastic}(t,prop) = w(t,prop) = \frac{production\_freq(w)}{30}$ (computed from McRae data).

## 0. Norms

### A. Load Norms

In [2]:
import pandas as pd

In [3]:
data_path = "/Users/jacobsw/Desktop/CODER/IMPLEMENTATION_CAMP/BASIC_TOPICS/DISTRIBUTIONAL_SEMANTICS/DATA/McRae-BRM-InPress/"

In [4]:
df = pd.read_csv(data_path+"CONCS_FEATS_concstats_brm.xls", delimiter='\t')

In [5]:
print df.columns

Index([u'Concept', u'Feature', u'WB_Label', u'WB_Maj', u'WB_Min', u'BR_Label',
       u'Prod_Freq', u'Rank_PF', u'Sum_PF_No_Tax', u'CPF', u'Disting',
       u'Distinct', u'CV_No_Tax', u'Intercorr_Str_Tax',
       u'Intercorr_Str_No_Tax', u'Feat_Length_Including_Spaces', u'Phon_1st',
       u'KF', u'ln(KF)', u'BNC', u'ln(BNC)', u'Familiarity', u'Length_Letters',
       u'Length_Phonemes', u'Length_Syllables', u'Bigram', u'Trigram',
       u'ColtheartN', u'Num_Feats_Tax', u'Num_Feats_No_Tax',
       u'Num_Disting_Feats_No_Tax', u'Disting_Feats_%_No_Tax',
       u'Mean_Distinct_No_Tax', u'Mean_CV_No_Tax', u'Density_No_Tax',
       u'Num_Corred_Pairs_No_Tax', u'%_Corred_Pairs_No_Tax', u'Num_Func',
       u'Num_Vis_Mot', u'Num_VisF&S', u'Num_Vis_Col', u'Num_Sound',
       u'Num_Taste', u'Num_Smell', u'Num_Tact', u'Num_Ency', u'Num_Tax'],
      dtype='object')


In [7]:
df.head(20)

Unnamed: 0,Concept,Feature,WB_Label,WB_Maj,WB_Min,BR_Label,Prod_Freq,Rank_PF,Sum_PF_No_Tax,CPF,...,Num_Func,Num_Vis_Mot,Num_VisF&S,Num_Vis_Col,Num_Sound,Num_Taste,Num_Smell,Num_Tact,Num_Ency,Num_Tax
0,accordion,a_musical_instrument,superordinate,c,h,taxonomic,28,1,,18,...,2,0,2,0,2,0,0,0,2,1
1,accordion,associated_with_polkas,associated_entity,s,e,encyclopaedic,9,4,9.0,1,...,2,0,2,0,2,0,0,0,2,1
2,accordion,has_buttons,external_component,e,ce,visual-form_and_surface,8,5,163.0,13,...,2,0,2,0,2,0,0,0,2,1
3,accordion,has_keys,external_component,e,ce,visual-form_and_surface,17,2,108.0,7,...,2,0,2,0,2,0,0,0,2,1
4,accordion,inbeh_-_produces_music,entity_behavior,e,b,sound,6,7,178.0,13,...,2,0,2,0,2,0,0,0,2,1
5,accordion,is_loud,external_surface_property,e,se,sound,6,7,317.0,34,...,2,0,2,0,2,0,0,0,2,1
6,accordion,requires_air,contingency,i,c,encyclopaedic,11,3,49.0,4,...,2,0,2,0,2,0,0,0,2,1
7,accordion,used_by_moving_bellows,action,s,a,function,8,5,8.0,1,...,2,0,2,0,2,0,0,0,2,1
8,accordion,worn_on_chest,function,s,f,function,6,7,6.0,1,...,2,0,2,0,2,0,0,0,2,1
9,airplane,beh_-_flies,entity_behavior,e,b,visual-motion,25,1,712.0,46,...,3,3,5,0,0,0,0,0,2,0


### B. Build Token-Lemma Lookup

In [11]:
from nltk.corpus import brown
from spacy.en import English

In [12]:
parser = English()

In [13]:
brown_sents = [unicode(' '.join(sent)) for sent in brown.sents()]

In [14]:
%%time
parsed_sents = [parser(sent) for sent in brown_sents]

CPU times: user 1min 36s, sys: 661 ms, total: 1min 37s
Wall time: 1min 37s


In [15]:
def make_token2lemma_dict(parsed_sents):
    
    lemmas = set()
    token2lemma = {}
    for parsed_sent in parsed_sents:
        for token in parsed_sent:
            token2lemma[token.orth_] = token.lemma_
            lemmas.add(token.lemma_)
    
    return lemmas, token2lemma

In [16]:
%%time
brown_lemmas, brown_t2l = make_token2lemma_dict(parsed_sents)

CPU times: user 1.22 s, sys: 32.9 ms, total: 1.26 s
Wall time: 1.25 s


In [25]:
print brown_t2l['books']

book


### C. Build Norm-Feature Lookup

In [55]:
norms = {df.ix[i]['Concept'] for i in range(df.shape[0])}
features = {df.ix[i]['Feature'] for i in range(df.shape[0])}

In [46]:
def norm_normalize(norm):
    
    norm = norm.split('_')[0] if '_' in norm else norm
    if norm in brown_t2l: return brown_t2l[norm]
    return norm


In [47]:
print norm_normalize('cat_(kitchen)')
print norm_normalize('cat')

cat
cat


In [43]:
# Count Out-Of-Vocab Norms For Brown
t = [] 
for norm in norms:
    norm = norm.split('_')[0] if '_' in norm else norm
    if norm in brown_lemmas or norm in brown_t2l: continue
    t.append(norm)

In [45]:
print t
print len(t)

['earmuffs', 'bike', 'screwdriver', 'unicycle', 'camisole', 'crossbow', 'hamster', 'bra', 'sledgehammer', 'skateboard', 'leotards', 'rhubarb', 'platypus', 'pelican', 'minnow', 'canary', 'spatula', 'motorcycle', 'iguana', 'chickadee', 'giraffe', 'tricycle', 'bazooka', 'tomahawk', 'ostrich', 'cucumber', 'lettuce', 'whale', 'stork', 'bluejay', 'colander', 'chipmunk', 'escalator', 'partridge', 'parka', 'zucchini', 'dunebuggy', 'machete', 'crowbar', 'housefly', 'blender', 'nectarine', 'scooter', 'cougar', 'penguin', 'emu', 'honeydew', 'wheelbarrow', 'harmonica', 'eggplant', 'groundhog', 'harpoon', 'yam', 'squid', 'toaster', 'moose', 'tuna', 'surfboard', 'nylons', 'raven', 'budgie', 'fridge', 'gopher', 'flamingo', 'sleigh', 'trombone', 'strainer', 'dagger', 'chimp', 'buzzard', 'guppy', 'grater', 'nightgown', 'cello', 'hornet', 'finch', 'tangerine', 'gorilla', 'caribou']
79


In [48]:
from collections import defaultdict

In [49]:
def make_norm2feature_dict(df):
    
    norm2feature = defaultdict(int)
    for i in xrange(df.shape[0]):
        norm = norm_normalize(df.ix[i]['Concept'])
        prop = df.ix[i]['Feature']
        norm2feature[(norm,prop)] = df.ix[i]['Prod_Freq'] # production frequency.
    
    return norm2feature

In [50]:
print df.ix[0]['Concept']
print df.ix[0]['Feature']
print df.ix[0]['Prod_Freq']

accordion
a_musical_instrument
28


In [51]:
%%time
norm2feature = make_norm2feature_dict(df)

CPU times: user 4.13 s, sys: 42.8 ms, total: 4.18 s
Wall time: 4.18 s


In [60]:
for prop in features:
    count = norm2feature[('airplane',prop)]
    if count!=0: print prop, count

used_for_transportation 10
is_fast 11
used_for_travel 7
has_a_propeller 5
has_wings 20
beh_-_flies 25
is_large 8
requires_pilots 11
has_engines 5
used_for_passengers 15
found_in_airports 8
made_of_metal 8
inbeh_-_crashes 7


## I. Compute Cooccurrence Matrices

### A. Build Cooccurrence Matrix I: General Distributional Matrix $t$-$c$

In [73]:
from collections import Counter
from string import punctuation as punc
from spacy.en import STOPWORDS

In [63]:
brown_fdist = Counter(list(brown.words()))

In [70]:
norms = set(map(lambda norm:norm_normalize(norm), norms)) # faster lookup.

In [77]:
stopwords = STOPWORDS
stopwords.add('``'); stopwords.add("''")

In [78]:
def tokensent2lemmasent(parsed_sents, fdist, norms, freq=20):
    
    sents_in_lemmas = []
    for parsed_sent in parsed_sents:
        sent = []
        for token in parsed_sent:
            if token.orth_ in norms or token.lemma_ in norms:
                sent.append(token.lemma_)
            elif fdist[token.orth_] < freq \
                or token.lemma_ in STOPWORDS \
                or token.lemma_ in punc: continue
            else: 
                sent.append(token.lemma_)
        sents_in_lemmas.append(sent)
    
    return sents_in_lemmas

In [79]:
%%time
brown_lemmasents = tokensent2lemmasent(parsed_sents, brown_fdist, norms)

CPU times: user 1.84 s, sys: 36.8 ms, total: 1.88 s
Wall time: 1.87 s


In [83]:
vocab = {word for sent in brown_lemmasents for word in sent}

In [84]:
len(vocab)

3988

In [87]:
import numpy as np

In [88]:
def build_cooccurrence_matrix(sents, vocab, win_size):
    
    w2i = {w:i for i,w in enumerate(vocab)}
    i2w = {i:w for i,w in enumerate(vocab)}
    print "... building dictionary"
    cooccurrence_dict = defaultdict(int)
    for sent in sents:
        for i,target in enumerate(sent):
            contexts = sent[max(0,i-win_size):i] + sent[min(i+1,len(sent)):min(i+1+win_size,len(sent))]
            for context in contexts:
                cooccurrence_dict[(target,context)] += 1
    print "... building cooccurrence matrix"
    cooccurrence_matrix = np.zeros((len(vocab),len(vocab)))
    for target in vocab:
        for context in vocab:
            cooccurrence_matrix[w2i[target]][w2i[context]] += cooccurrence_dict[(target,context)]
    
    return w2i, i2w, cooccurrence_matrix

In [94]:
%%time
w2i, i2w, t2c_mat = build_cooccurrence_matrix(brown_lemmasents, vocab, 2)

... building dictionary
... building cooccurrence matrix
CPU times: user 26.2 s, sys: 1.47 s, total: 27.7 s
Wall time: 27.8 s


In [95]:
w2i['cat']

1190

In [97]:
t2c_mat[1190].sum()

136.0

### B. Build Cooccurrence Matrix II: Paradigmatic Matrix $t$-$subj$ & $t$-$obj$

In [99]:
def extract_dep_triples(parsed_sents):
    
    triples = []
    for parsed_sent in parsed_sents:
        for token in parsed_sent:
            lemma_triple = (token.lemma_, token.dep_, token.head.lemma_)
            triples.append(lemma_triple)
    
    return triples

In [100]:
%%time
dep_triples = extract_dep_triples(parsed_sents)

CPU times: user 932 ms, sys: 102 ms, total: 1.03 s
Wall time: 1.03 s


In [125]:
# get subj/obj vocabs
def get_arg_set(triples, argtype): # argtype = {subj, obj}
    
    arg_set = set()
    for triple in triples:
        if triple[1].endswith(argtype):
            arg_set.add(triple[0])
    
    return arg_set


In [104]:
%%time
subj_set, obj_set = get_arg_set(dep_triples, 'subj'), get_arg_set(dep_triples, 'obj')

CPU times: user 906 ms, sys: 9.79 ms, total: 915 ms
Wall time: 913 ms


In [110]:
print len(subj_set)
print len(obj_set)

8729
16274


In [126]:
# get N:[v1...vn] dict
def get_pred_list(triples, argtype): # argtype = {subj, obj}
    
    pred_list = defaultdict(list)
    for triple in triples:
        if triple[1].endswith(argtype):
            pred_list[triple[0]].append(triple[2])
    
    return pred_list


In [127]:
%%time
subj2pred, obj2pred = get_pred_list(dep_triples, 'subj'), get_pred_list(dep_triples, 'obj')

CPU times: user 970 ms, sys: 365 ms, total: 1.34 s
Wall time: 1.3 s


In [135]:
print subj2pred['cat']
print obj2pred['cat']

[u'be', u'come', u'out', u'be', u'complement', u'be', u'go', u'meet', u'manage']
[u'for', u'like', u'feed', u'have', u'put', u'as', u'like', u'like', u'like', u'to', u'at', u'like', u'like', u'like', u'watch', u'dye', u'of', u'send']


In [155]:
def build_para_matrix(pred_list, vocab): # pred_list: dict; vocab: list
    
    w2i = {w:i for i,w in enumerate(vocab)}
    i2w = {i:w for i,w in enumerate(vocab)}
    para_matrix = np.zeros((len(vocab),len(vocab)))
    for i,wi in enumerate(vocab):
        if i!=0 and i%1000==0: print "... processed %d words." % i
        for wj in vocab:
            if wi==wj: continue
            for pred in pred_list[wi]: # if wi has v*3, wj has v*2, then wiwj cell = 3*2 = 6 <= "paradigmatic weight"
                para_matrix[w2i[wi]][w2i[wj]] += pred_list[wj].count(pred)
    
    return w2i, i2w, para_matrix
    

In [156]:
%%time
w2i_subj, i2w_subj, subj_para_mat = build_para_matrix(subj2pred, list(subj_set)) # expected: 8~9 mins

... processed 1000 words.
... processed 2000 words.
... processed 3000 words.
... processed 4000 words.
... processed 5000 words.
... processed 6000 words.
... processed 7000 words.
... processed 8000 words.
CPU times: user 15min 41s, sys: 3.99 s, total: 15min 45s
Wall time: 15min 46s


In [158]:
%%time
obj_para_mat = build_para_matrix(obj2pred, list(obj_set)) # expected: 16~17 mins

In [160]:
import cPickle

In [162]:
path = "/Users/jacobsw/Desktop/UNIV/FALL_2016/LIN389C_RSCH_COMPLING/BAYESIAN/CODE_DRAFTS/DATA/"
cPickle.dump((w2i_subj,i2w_subj,subj_para_mat), open(path+"subj_para_mat.p",'wb'))
# cPickle.dump((w2i_obj,i2w_obj,obj_para_mat), open(path+"obj_para_mat.p",'wb'))
# w2i_subj,i2w_subj,subj_para_mat = cPickle.load(open(path+"subj_para_mat.p",'rb'))
# w2i_subj,i2w_obj,subj_para_mat = cPickle.load(open(path+"obj_para_mat.p",'rb'))

## III. Topic Modeling

## IV. Compute Similarity