# Word Sense Disambiguation using a Sensegram

## Intuition

### Sense Determination

We calculate the sense of a target word in a particular context by maximizing the cosine similarity between the aggregate context vector (average of the context word vectors after removing the stop words) and the different sense vectors of the target word.

### Evaluation

For evaluation, we check the key to **group** all the sentences in the test data which have the same sense for the same target word. Then we run the function on all these sentences (of a *group*) to check whether most of them (ideally all of them) have the same index or not.

On running the function on all these sentences (of a *group*) we get the sense indices. We are making an assumption here, that is, the most common sense index that we are obtaining is the correct sense index for this *group* of sentences. Then the measure of accuracy is calculated using the formula:

```
accuracy = ∑(g) #(most_common_index(g)) / total_sentences
```

where `#(most_common_index(g))` gives the number of occurences of the most common index on running the function on a *group* `g` and `total_sentences` is the total number of sentences in the test dataset which give a valid output on running the function.

## Imports and Initializations

We need to import `numpy` for working with arrays, and other libs like `os`, `pickle` and `pprint` for other utility functions.

In [1]:
import os, pprint, pickle, re
import numpy as np
from stop_words import get_stop_words
import nltk
import time

lem = nltk.stem.wordnet.WordNetLemmatizer()
pp = pprint.PrettyPrinter(indent=2)

TEST_SENTENCES_PATH = '/Users/sounak/Documents/clg/nlp/nlp-projects/data/wsd/sentences.txt'

## Helper functions

The two helper functions `save_obj` and `load_obj` are used to pickle any object and load back the pickle file. These functions will be useful in saving the vector dicts and thus faster loading of the same.

In [2]:
class ListTable(list):
    """ Overridden list class which takes a 2-dimensional list of 
        the form [[1,2,3],[4,5,6]], and renders an HTML Table in 
        IPython Notebook. """
    
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html)

def save_obj(obj, name):
    if 'obj' not in os.listdir():
        os.mkdir('obj')
    with open('obj/'+ name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name):
    try:
        with open('obj/' + name + '.pkl', 'rb') as f:
            return pickle.load(f)
    except:
        return None

## Loading the Sensegram

In [3]:
start = time.time()

sense_vecs = load_obj('sense_vecs')
pos_tags = load_obj('pos_tags')

if not (sense_vecs and pos_tags):
    SENSEGRAM_PATH = "/Users/sounak/Documents/clg/nlp/nlp-projects/data/sensegrams_of_wikipedia_cluster"
    f = open(SENSEGRAM_PATH, 'r')
    sense_vecs = {}
    pos_tags = set()

    for line in f.readlines():
        t = line.split('\t')
        word, pos = t[0].split('#')
        pos_tags.add(pos)
        if t[1] == '0':
            sense_vecs[(word, pos)] = []
        sense_vecs[(word, pos)].append(np.array(eval(t[2])))
    f.close()
    save_obj(sense_vecs, 'sense_vecs')
    save_obj(pos_tags, 'pos_tags')

end = time.time()
    
print('sense_vecs have been loaded in: ', (end - start), 'secs')

sense_vecs have been loaded in:  1.356163740158081 secs


## Loading the Glove Model

In [4]:
start = time.time()

word_vecs = load_obj('word_vecs')

if not word_vecs:
    GLOVE_PATH = "/Users/sounak/Documents/clg/nlp/nlp-projects/data/glove.6B.300d.txt"
    f = open(GLOVE_PATH, 'r')
    word_vecs = {}
    for line in f.readlines():
        t = line.split(' ')
        word_vecs[t[0]] = np.array([float(_) for _ in t[1:]])
    f.close()
    save_obj(word_vecs, 'word_vecs')
    
end = time.time()    
    
print('word_vecs have been loaded in: ', (end - start), 'secs')

word_vecs have been loaded in:  3.3434998989105225 secs


## Computing Sense

The function `compute_sense_idx` takes a sentence, the target and some other arguments and returns the index of the sense of the target that was used in the current context.

This function maximizes the cosine similarity of an aggregate context vector with the vectors of the different senses of the target word. It also doesn't include the stop words in the context. The aggregate context vector is calculated using the lemmatized words in the context after removing the stop words.

In [5]:
stop_words = get_stop_words('en')

def compute_sense_idx(sentence, target):
    if target not in sentence:
        return None, None
    sentence = nltk.pos_tag(sentence)
    context = list(filter(lambda x: x[0] != target, sentence))
    _sum = np.zeros(300)
    preprocess = lambda w, pos : (lem.lemmatize(w, pos[0].lower()), pos) if pos[0].lower() in ['a', 'r', 'n', 'v'] else (w, pos)
    context_final = [preprocess(w, pos) for w, pos in context if w not in stop_words]
    for w, _ in context_final:
        try:
            _sum += word_vecs[w]
        except KeyError:
            continue
        
    cw_mean = np.divide(_sum, len(context))
    max_idx = -1
    max_value = float('-inf')
    max_pos = None
    for pos in pos_tags:
        try:
            for idx, sense in enumerate(sense_vecs[(target, pos)]):
                if np.linalg.norm(sense) > 0:
                    result = np.divide(np.dot(sense, cw_mean), (np.linalg.norm(sense) * np.linalg.norm(cw_mean)))
                    if result > max_value:
                        max_value = result
                        max_idx = idx
                        max_pos = pos
        except KeyError:
            continue
    return max_idx, max_pos

## Tokenizer

This is a light-weight tokenizer for tokenizing the input sentences.

In [6]:
def tokenize(text):
    words = [_.lower() for _ in re.split(r"[^a-zA-ZÀ-ÿ0-9']+", text)]
    words = [_[:-2] if "'s" in _ else _ for _ in words]
    return list(filter(('').__ne__, words))

## Testing

We are testing the function on the SemEval Test Dataset.

In [7]:
start = time.time()

from xml.dom.minidom import parse
FILE = './data/wsd-test/contexts/senseval2-format/semeval-2013-task-13-test-data.senseval2.xml'

dom = parse(FILE)
inst = dom.getElementsByTagName('instance')

sents_raw = {}
sents = {}

for i in inst:
    k = i.attributes['id'].value
    context = i.getElementsByTagName('context')[0]
    word = context.getElementsByTagName('head')[0].childNodes[0].nodeValue
    v = ' {} '.format(word).join(t.nodeValue.strip() for t in context.childNodes if t.nodeType == t.TEXT_NODE)
    sents_raw[k] = v
    sents[k + '.' + word] = tokenize(v)

end = time.time()    
    
print('test sentences have been loaded in: ', (end - start), 'secs')

test sentences have been loaded in:  0.46253013610839844 secs


In [8]:
res = {}
pos = {}
times = []

for i in range(10):
    start = time.time()
    
    for k, v in sents.items():
        idx, p = compute_sense_idx(v, k.split('.')[-1])
        res['.'.join(k.split('.')[:-1])] = idx
        pos['.'.join(k.split('.')[:-1])] = p

    end = time.time()

    times.append(end - start)
    print('(iter {})'.format(i + 1), 'results have been loaded in: ', times[i], 'secs')

(iter 1) results have been loaded in:  13.639203071594238 secs
(iter 2) results have been loaded in:  10.572648048400879 secs
(iter 3) results have been loaded in:  11.374758958816528 secs
(iter 4) results have been loaded in:  10.52796196937561 secs
(iter 5) results have been loaded in:  10.536474704742432 secs
(iter 6) results have been loaded in:  10.512830018997192 secs
(iter 7) results have been loaded in:  10.635956764221191 secs
(iter 8) results have been loaded in:  10.488242149353027 secs
(iter 9) results have been loaded in:  10.605721950531006 secs
(iter 10) results have been loaded in:  10.517795085906982 secs


In [9]:
aver = sum(times) / len(times)
aver_single = aver / len(res)

print('average time for testing a single sentence is: ', aver_single)

average time for testing a single sentence is:  0.0023418577209319154


## Evaluation



In [10]:
start = time.time()

KEY = './data/wsd-test/keys/gold/all.singlesense.key'

keys = {}
keys_rev = {}
f = open(KEY, 'r')
for line in f.readlines():
    l = line.strip().split(' ')
    keys[l[1]] = l[2].split(':')[0].split('%')[1]
    try:
        keys_rev[l[0] + '%' + l[2].split(':')[0].split('%')[1]].append(l[1])
    except KeyError:
        keys_rev[l[0] + '%' + l[2].split(':')[0].split('%')[1]] = [l[1]]
    
end = time.time()
    
print('keys have been loaded in: ', (end - start), 'secs')

keys have been loaded in:  0.020737886428833008 secs


In [11]:
from collections import Counter

correct = _sum = 0
counts_raw = {}
for k in keys_rev.keys():
    counts_raw[k] = Counter([res[_] for _ in keys_rev[k]])
    del counts_raw[k][-1]
    del counts_raw[k][None]
    _sum += sum(counts_raw[k].values())
    correct += counts_raw[k].most_common(1)[0][1]
    
counts = {}
for k, v in counts_raw.items():
    try:
        counts[k.split('%')[0]][int(k.split('%')[1])] = v
    except KeyError:
        counts[k.split('%')[0]] = {int(k.split('%')[1]): v}
    
print('Accuracy: ', (correct / _sum))
print('Total number of sentences: ', _sum)

Accuracy:  0.8084126496776174
Total number of sentences:  3257


In [12]:
TP = 0
FN = 0
FP = 0

for k1, v1 in counts.items():
    for k2, v2 in v1.items():
        TP += v2.most_common(1)[0][1]
        FN += sum(v2.values()) - v2.most_common(1)[0][1]
        
p = TP / (TP + FP)
r = TP / (TP + FN)
f1 = 2 * ((p * r) / (p + r))

print('Precision: ', p)
print('Recall: ', r)
print('F1-score: ', f1)

Precision:  1.0
Recall:  0.8084126496776174
F1-score:  0.8940577249575552


## Test Report

In [13]:
from astropy.table import Table
p = Table(names=['Key', 'Sentence', 'Index', 'Tag'], dtype=('S10', 'S1000', 'i', 'S10'))

for k, v in keys_rev.items():
    for i in v:
        if res[i] not in [None, -1]:
            p.add_row((k, sents_raw[i], res[i], pos[i]))

p.show_in_notebook(show_row_index=False)

Key,Sentence,Index,Tag
add.v%2,if you add the um uh people of various sexual persuasions and those who never intend to marry and those who are retired and those who are um just looking for fun they people with families turn out to be such a small minority that they can't get the tax bill passed no matter what happens,0,VB
add.v%2,"An added benefit of a warm winter, of course, would be that it would make it clearer than ever that Gore's concern about global warming is well-founded.",0,JJ
add.v%2,"To prepare a 10% solution of acid, add 10 mL of concentrated acid to 90 mL of deionized water.",0,VB
add.v%2,"To find the day of the week on which that date fell, first divide 1947 by 4: the answer is 486 (with a remainder of 3, which should be ignored); add this number: 1947 + 486 = 2433 .",0,VB
add.v%2,uh i added up all the taxes that we were going to pay on all these different specific luxury items and travel expenses and everything else,0,JJ
add.v%2,"The tripe with onions and garlic is cooked for several hours, posole or hominy is added , along with red chile.",0,JJ
add.v%2,If the response-to-advertising mail were also added to advertising mail it would have resulted in double counting.,0,JJ
add.v%2,"But both also miss his perplexing reality, the ambiguity of a man whose life and work refuse to add up to a simple object lesson.",0,VB
add.v%2,To these natural ingredients have been added just about every amenity and service needed for the perfect holiday.,0,JJ
add.v%2,"Simply go to www.pointcast.com, download the PointCast installation software, install the software, configure the software (using the easy-to-misunderstand instructions), then go looking for Slate, which can be found by simply right-clicking on the box labeled ""connections,"" choosing the ""personalize"" option, clicking the "" add "" button, then clicking on ""news and weather,"" then scrolling halfway to Ohio until you come to a listing for Slate, and then clicking on ""subscribe.""",1,VB
