# Building Sense Embeddings

The sense embeddings of a certain sense is calculated by averaging the context embeddings of all context in which certain sense exists. There exists several different methods for combining words embeddings to form context embeddings. Our starting poing is applying plain average (bag of word). 

Reference: Iaacobaci et al, Embeddings for Word Sense Disambiguation: An Evaluation Study
http://aclweb.org/anthology/P/P16/P16-1085.pdf

In [119]:
# Import neccesary libraries
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import semcor
import numpy as np
import collections
import os
import pickle
import dill
import torch
import data
import torch.nn as nn
from torch.autograd import Variable

In [6]:
corpus = data.Corpus('./data/brown/')

NameError: name 'data' is not defined

In [100]:
corpus.dictionary.word2idx['<UNK>']

4

In [89]:
# #Load an example embeddings
# embedding_dict = pickle.load(open('glove_50d_50kvoc.pk','rb'))
# example_sentence = semcor.sents()[0]

#Load our trained embedding for test
#need model.py in directory
model_trained = torch.load('model_winSize4_cpu.pt')
emb_trained = model_trained.encoder
embedding_dict = emb_trained.weight.data.numpy()

In [10]:
example_chunk = semcor.tagged_sents(tag='sem')[0]

In [11]:
example_sentence_list = semcor.tagged_sents(tag='sem')[:10]

In [101]:
# Build a function to combine word to form context embedding:
def getContextEmb(sentence,center,window_size,embedding_dict,emb_size):
    # Input introductions
    # sentence: an array of tokens of untagged sentence. 
    # center: position of the center word
    # window_size: size of context window
    # embedding_Dict: embedding dictionary used to calculate context
    ################################################################
    start_pos = max([0,center-window_size])
    end_pos = min([len(sentence),(center+window_size)+1])
    context_tokens = sentence[start_pos:end_pos]
    output_embedding = np.zeros(emb_size)
    for word in context_tokens:
        try:
            output_embedding+=embedding_dict[word]
        except:
            output_embedding+=embedding_dict[4]#Unknown vector
    return output_embedding

Trying to create a method to form a dictionary of sense embeddings.

In [6]:
def buildSemEmb(tagged_sents,emb_size,embedding_dict,context_builder = getContextEmb):
    output_dict = collections.defaultdict(lambda: np.zeros(emb_size))
    count_dict = collections.defaultdict(lambda: 0)
    for sentence in tagged_sents:
        #print(sentence)
        for idx,chunk in enumerate(sentence):
            if(type(chunk))==list:
                continue
            else:
                #Use try except handling since some of the label is broken
                try:
                    sense_index = chunk.label().synset().name()
                except:
                    continue
                context_emb = context_builder(sentence,idx,4,embedding_dict,emb_size)
                output_dict[sense_index]+=context_emb
                count_dict[sense_index]+=1
    # Averaging
    for key in output_dict.keys():
        output_dict[key] /= count_dict[key]
    return output_dict

Now we build a sense embedding dictionary for prediction. Notice that the ouput dictionary of buildSemEmb() is a collection.defaultdict() with default value being the uniform random vector. Hence it returns a uniform random vector when some sense does not exists.

In [7]:
#Build sense dictionary for semcor corpus
semcor_senseEmb = buildSemEmb(semcor.tagged_sents(tag='sem'),512,embedding_dict)



Using a trained embeddings and the sense embeddings that we derived by averaging the context. We can build a classifier that directly compare the bag of words (the average embeddings of the entire sentence) with sense embeddings and output the sense with highest cosine similarity.

## Expriment: bag of word comparison with sense embeddings

First we will make a 

In [109]:
example_word = 'produce'
example_sentence = semcor.sents()[0]
example_context = getContextEmb(center=15,emb_size=512,embedding_dict=embedding_dict,\
                                sentence=example_sentence,window_size=4)
#print(example_context)
from scipy.spatial.distance import cosine
choices = [synset.name() for synset in wn.synsets('produced')]
embeddings = [(choice,semcor_senseEmb[choice]) for choice in choices]
#print(embeddings)
decision_chart = [(choice,cosine(example_context,semcor_senseEmb[choice])) for choice in choices]
decision_chart



[('produce.v.01', 0.88077156310955762),
 ('produce.v.02', 1.0337513225187784),
 ('produce.v.03', 0.96219179819244816),
 ('produce.v.04', 1.0077375234662809),
 ('grow.v.07', 0.95237663654891869),
 ('produce.v.06', 0.97578513729588945),
 ('grow.v.08', 1.0259909720297444)]

In [112]:
wn.synsets('produce')[1].definition()

'bring forth or yield'

In [61]:
wn.synsets('produce')[4].definition()

'bring out for display'

# Experiment using hold out prediction vector

In [117]:
context_eg = example_sentence[15-4:15+4+1]
holdout = context_eg[4]
print('Context: %s'%(' '.join(context_eg)))
print('Holdout word: %s'%(holdout))

Context: 's recent primary election produced `` no evidence ''
Holdout word: produced


In [120]:
context_idx = Variable(torch.LongTensor([corpus.dictionary.word2idx[word] for word in context_eg]).view(-1,1))
hidden = model_trained.init_hidden(1)
output,hidden = model_trained.forward(context_idx,hidden)
output_np = output.data.numpy()

KeyError: "'s"