# Building Sense Embeddings

The sense embeddings of a certain sense is calculated by averaging the context embeddings of all context in which certain sense exists. There exists several different methods for combining words embeddings to form context embeddings. Our starting poing is applying plain average (bag of word). 

Reference: Iaacobaci et al, Embeddings for Word Sense Disambiguation: An Evaluation Study
http://aclweb.org/anthology/P/P16/P16-1085.pdf

In [2]:
# Import neccesary libraries
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import semcor
import numpy as np
import collections
import os
import pickle
import dill

In [3]:
#Load an example embeddings
embedding_dict = pickle.load(open('glove_50d_50kvoc.pk','rb'))
example_sentence = semcor.sents()[0]

In [4]:
example_chunk = semcor.tagged_sents(tag='sem')[0]

In [5]:
example_sentence_list = semcor.tagged_sents(tag='sem')[:10]

In [6]:
# Build a function to combine word to form context embedding:
def getContextEmb(sentence,center,window_size,embedding_dict,emb_size):
    # Input introductions
    # sentence: an array of tokens of untagged sentence. 
    # center: position of the center word
    # window_size: size of context window
    # embedding_Dict: embedding dictionary used to calculate context
    ################################################################
    start_pos = max([0,center-window_size])
    end_pos = min([len(sentence),(center+window_size)+1])
    context_tokens = sentence[start_pos:end_pos]
    output_embedding = np.zeros(emb_size)
    for word in context_tokens:
        try:
            output_embedding+=embedding_dict[word]
        except:
            output_embedding+=np.random.uniform(1,-1,emb_size)
    return output_embedding

Trying to create a method to form a dictionary of sense embeddings.

In [7]:
def buildSemEmb(tagged_sents,emb_size,embedding_dict,context_builder = getContextEmb):
    output_dict = collections.defaultdict(lambda: np.zeros(emb_size))
    count_dict = collections.defaultdict(lambda: 0)
    for sentence in tagged_sents:
        #print(sentence)
        for idx,chunk in enumerate(sentence):
            if(type(chunk))==list:
                continue
            else:
                #Use try except handling since some of the label is broken
                try:
                    sense_index = chunk.label().synset().name()
                except:
                    continue
                context_emb = context_builder(sentence,idx,3,embedding_dict,emb_size)
                output_dict[sense_index]+=context_emb
                count_dict[sense_index]+=1
    # Averaging
    for key in output_dict.keys():
        output_dict[key] /= count_dict[key]
    return output_dict

Now we build a sense embedding dictionary for prediction. Notice that the ouput dictionary of buildSemEmb() is a collection.defaultdict() with default value being the uniform random vector. Hence it returns a uniform random vector when some sense does not exists.

In [8]:
#Build sense dictionary for semcor corpus
semcor_senseEmb = buildSemEmb(semcor.tagged_sents(tag='sem'),50,embedding_dict)

In [9]:
semcor_senseEmb['commitment.n.03']

array([ 1.30003492, -0.09053706, -0.22600043,  1.06458565, -1.23842165,
       -2.46055579, -3.43967801, -3.36920484,  0.13196153, -2.7397281 ,
        2.89399486,  1.65822149,  2.10908488, -0.95701491, -1.44100407,
       -0.11627587,  1.65703584,  0.85361903,  2.04998585, -0.19486962,
        0.48201897, -2.1837728 , -0.83654919, -0.47885907, -1.16298954,
       -1.42235606,  0.90693904, -0.85483815,  1.06258396, -1.00788814,
        0.41451083, -1.4589622 , -0.9678323 ,  1.92015602, -1.31059001,
        2.3856872 ,  0.21004829, -0.03023176, -2.02274583, -1.1966199 ,
       -1.82946856,  1.2398847 , -1.77989179,  0.68169361, -0.47411504,
       -1.97803033, -0.10702167,  2.10165498,  3.00362617, -2.48403455])

## Expriment: bag of word comparison with sense embeddings

Using a trained embeddings and the sense embeddings that we derived by averaging the context. We can build a classifier that directly compare the bag of words (the average embeddings of the entire sentence) with sense embeddings and output the sense with highest cosine similarity.

In [11]:
example_chunk

[['The'],
 Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]),
 Tree(Lemma('state.v.01.say'), ['said']),
 Tree(Lemma('friday.n.01.Friday'), ['Friday']),
 ['an'],
 Tree(Lemma('probe.n.01.investigation'), ['investigation']),
 ['of'],
 Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']),
 ["'s"],
 Tree(Lemma('late.s.03.recent'), ['recent']),
 Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']),
 Tree(Lemma('produce.v.04.produce'), ['produced']),
 ['``'],
 ['no'],
 Tree(Lemma('evidence.n.01.evidence'), ['evidence']),
 ["''"],
 ['that'],
 ['any'],
 Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']),
 Tree(Lemma('happen.v.01.take_place'), ['took', 'place']),
 ['.']]

In [27]:
example_word = 'primary'
example_context = getContextEmb(center=15,emb_size=50,embedding_dict=embedding_dict,sentence=example_sentence,window_size=2)

from scipy.spatial.distance import cosine

choices = [synset.name() for synset in wn.synsets('produced')]

decision_chart = [(choice,cosine(example_context,semcor_senseEmb[choice])) for choice in choices]

decision_chart

[('produce.v.01', 0.79938981141141463),
 ('produce.v.02', 0.95091995923310568),
 ('produce.v.03', 1.1212183515751915),
 ('produce.v.04', 0.9652418620686104),
 ('grow.v.07', 0.95999618009120324),
 ('produce.v.06', 0.86363817985361668),
 ('grow.v.08', 1.2094539090980427)]

In [25]:
wn.synsets('produce')[1].definition()

'bring forth or yield'

In [26]:
wn.synsets('produce')[4].definition()

'bring out for display'