# A general application of Bayesian & BERT based methods to Metaphor Source Domain Inference

This is a linking story. I wanted to show that metaphor source domain comes from . . . somewhere. I wanted to show that just low level linguistic phenomena can account for what we call metaphor. Lots of research in conceptual metaphor has discussed at length how people make sense of metaphor, without first describing how the linguistic form of an utterance can help transmit that metaphoric content. We start by creating a contextual representation of a target term in metaphoric utterance:

$$ E_{x,w'} = BERT(x) \delta_{w=w'} $$

I already did this step prior to this markup. So we'll just import those representations here.

In [None]:
from GovMetaphors.mod.sim_matrix import *
soft = nn.Softmax(dim=-1)
m = torch.load('GovMetaphors/corpora/GovTweets.pt')

CM = m['CM-full']
cV = torch.FloatTensor([[np.float(n) for n in vec.replace('[','').replace(']','').split(', ')] for vec in CM['vec'].values])

Now, we get to the magic. We need to find the probability that an embedding $E_{x,w'}$ could come from some other context associated with an embedding $E_{m,w'_m}$. We do this via

$$ P(E_{x,w'}|E_{m,w'_m}) = P_{\mathcal{N}_{[0,\infty]}} \left( cosineError(E_{x,w'},E_{m,w'_m}) \bigg| \mu=0, \sigma \right) $$

In [None]:
pi = probFn(.3)
P = pi.PROB(cV,cV)
r = P

Now, these are independent sentences. We need to create metaphor source domain categories. How? Well...

$$ P(M|E_{x,w'}) = \frac{1}{k_M} P(M) \sum_m P(E_{x,w'}|E_{m,w'_m}) \delta_{m \in M} $$

In [None]:
#Sentence indeces . . . we use this to exclude the example
# given by that row . . . because it'll have a probability
# close to 1, and because it's kinda irrelevant for
# comparison.
cidx = np.array([i for i in range(len(CM))])

#Heads up! not every CM tag in our dataset has > 2 examples.
# We thus restrict our data to only those CM tags with 2 or
# more examples in 'em.
good_values = np.array([k for k,v in CM['CM'].value_counts().items() if v > 1])

#Implments the equation listed in the last markdown.
OUT = torch.FloatTensor([[row[sel([m_],CM['CM'].values) & ~sel([i],cidx)].sum().item() for m_ in good_values] for i,row in enumerate(r)])
OUT = OUT * torch.FloatTensor([(sel([m_],CM['CM'].values).sum()/len(sel([m_],CM['CM'].values))) for m_ in good_values]).unsqueeze(0)
print(OUT.shape)
OUT = OUT/(OUT.sum(dim=0))

And finally, we validate (1) precision (with $\sigma=.3$, .84) and recall (with $\sigma=.3$, .85)

In [None]:
#(1) Precision
PRECISION = np.array([OUT[sel([cm],CM['CM'].values)][:,sel([cm],np.array(good_values))].mean() > OUT[sel([cm],CM['CM'].values)][:,~sel([cm],np.array(good_values))].mean() for cm in good_values]).mean()

#(2) Recall
RECALL = np.array([OUT[sel([cm],CM['CM'].values)][:,sel([cm],np.array(good_values))].mean() > OUT[~sel([cm],CM['CM'].values)][:,sel([cm],np.array(good_values))].mean() for cm in good_values]).mean()

and cluster fit (with $\sigma=.3$, 8.08)

In [None]:
np.array([OUT[sel([cm],CM['CM'].values)][:,sel([cm],np.array(good_values))].mean() / OUT[sel([cm],CM['CM'].values)][:,~sel([cm],np.array(good_values))].mean() for cm in good_values]).mean()

Permutation test procedures:

In [None]:
# (1) Randomizing examples (P = 0.0)
prec, rec = [],[]
for ep in range(1000):
    # print('epoch {}/{}'.format(ep+1,1000))
    # CMS = np.random.choice(good_values,size=(len(good_values),),replace=False)
    ROWS = np.random.choice(CM['CM'].values,size=(len(CM['CM'].values),),replace=False)
    prec.append( np.array([OUT[sel([cm],ROWS)][:,sel([cm],good_values)].mean() > OUT[sel([cm],ROWS)][:,~sel([cm],good_values)].mean() for cm in good_values]).mean() )
    rec.append( np.array([OUT[sel([cm],ROWS)][:,sel([cm],good_values)].mean() > OUT[~sel([cm],ROWS)][:,sel([cm],good_values)].mean() for cm in good_values]).mean() )
prec,rec = np.array(prec), np.array(rec)
(PRECISION <= prec).mean(),(RECALL <= rec).mean()



# (2) Randomizing CM Source categories (P = 0.0)
prec, rec = [],[]
for ep in range(1000):
    # print('epoch {}/{}'.format(ep+1,1000))
    CMS = np.random.choice(good_values,size=(len(good_values),),replace=False)
    # ROWS = np.random.choice(CM['CM'].values,size=(len(CM['CM'].values),),replace=False)
    prec.append( np.array([OUT[sel([cm],CM['CM'].values)][:,sel([cm],CMS)].mean() > OUT[sel([cm],CM['CM'].values)][:,~sel([cm],CMS)].mean() for cm in CMS]).mean() )
    rec.append( np.array([OUT[sel([cm],CM['CM'].values)][:,sel([cm],CMS)].mean() > OUT[~sel([cm],CM['CM'].values)][:,sel([cm],CMS)].mean() for cm in CMS]).mean() )
prec,rec = np.array(prec), np.array(rec)
(PRECISION <= prec).mean(),(RECALL <= rec).mean()



# (3) Randomizing both (P = 0.0)
prec, rec = [],[]
for ep in range(1000):
    # print('epoch {}/{}'.format(ep+1,1000))
    CMS = np.random.choice(good_values,size=(len(good_values),),replace=False)
    ROWS = np.random.choice(CM['CM'].values,size=(len(CM['CM'].values),),replace=False)
    prec.append( np.array([OUT[sel([cm],ROWS)][:,sel([cm],CMS)].mean() > OUT[sel([cm],ROWS)][:,~sel([cm],CMS)].mean() for cm in CMS]).mean() )
    rec.append( np.array([OUT[sel([cm],ROWS)][:,sel([cm],CMS)].mean() > OUT[~sel([cm],ROWS)][:,sel([cm],CMS)].mean() for cm in CMS]).mean() )
prec,rec = np.array(prec), np.array(rec)
(PRECISION <= prec).mean(),(RECALL <= rec).mean()