# Word Sense Disambiguation Knowledge Sources
To train an all-word WSD model, several knowledge sources are neccesary. Including :
- Sense inventory
- Sense labeled corpus
- Embeddings

Here we explore how to leverage these knowledge sources using Python.

## Sense inventory word-net

The powerful NLTK package includes handy wordnet interface. The senses of a word is stored as form of synset, which includes the definition of a sense and words related to this sense.

For a more detailed documentation of wordnet interface, please refer to : http://www.nltk.org/howto/wordnet.html

For reading and documentation about NLTK package, please read:
http://www.nltk.org/book/

In [1]:
import nltk

In [2]:
from nltk.corpus import wordnet as wn

To get access to all the synsets of a particular word, use wn.synsets(word).

In [3]:
# Example. 
wn.synsets('qualify')

[Synset('qualify.v.01'),
 Synset('qualify.v.02'),
 Synset('qualify.v.03'),
 Synset('qualify.v.04'),
 Synset('stipulate.v.01'),
 Synset('qualify.v.06'),
 Synset('modify.v.02')]

Under every synnet, it contains the lemmas that are  in this synnet. use .lemmas() to access them.

In [4]:
# Example: Let's choose the third synnet of the word 'qualify', which is not its usual sense.
qualify_03 = wn.synsets('qualify')[2]
# It is also easy to print out its definition
print('qualify_.v.03 definition: %s'%(qualify_03.definition()))
qualify_03.lemmas()

qualify_.v.03 definition: make more specific


[Lemma('qualify.v.03.qualify'), Lemma('qualify.v.03.restrict')]

## Sense labeled corpus
NLTK also contains wordnet's senses labeled corpus: SemCor. For a detail documentation of NTLK tagged corpora please refer to:
http://www.nltk.org/howto/corpus.html#tagged-corpora

In [6]:
from nltk.corpus import semcor

SemCor corpus is a chunk corpus that would group words in a sentence into chunks. The senses taged are associated with each chunk (usually nouns phrases will be chunked) instead of word (One can see how bad word level definition could be). To access to all chunks simply use .tagged_chunks(). One can access to either or both POS and sense tag of each chunk by specifying tag='pos' tag='sem' or tag='both'

In [83]:
chunks = semcor.tagged_chunks(tag='both')

In [75]:
example_sentence = semcor.tagged_sents(tag='sem')[0]
print(example_sentence)

[['The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), ['an'], Tree(Lemma('probe.n.01.investigation'), ['investigation']), ['of'], Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), ["'s"], Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), ['``'], ['no'], Tree(Lemma('evidence.n.01.evidence'), ['evidence']), ["''"], ['that'], ['any'], Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), ['.']]


### Useful methods of chunk class object

In [87]:
example_chunk= chunks[1]

In [91]:
#Accessing words in a chunk
example_chunk.leaves()

['Fulton', 'County', 'Grand', 'Jury']

In [92]:
#Accessing the lemma of a chunk.
example_chunk.label()

Lemma('group.n.01.group')

In [95]:
#Accessing the synset of a chunk
example_chunk.label().synset()

Synset('group.n.01')

If one is interested in the pure sentence instead of its labels and syntatic tree. We can use .sents() method to access to tokenized centences.

In [7]:
sentence = semcor.sents()[0]

In [13]:
example_synset = wn.synsets('group')[0]
example_synset.name()

'group.n.01'

## Word Embeddings

Two useful pretrained embeddings are: Word2Vec and Glove. The following functions are used to extract word embeddings from GloVe dataset.
load_glove_embeddings(glove_directory,emsize,voc_size) it takes in:

- glove_directory: the directory(default glove.6B) you save your glove embeddings 
- emsize: the embedding size of your glove embeddings must be one of 50,100 amd 300
- voc_size: number of vocabulary you want to load.

In [9]:
import os
import collections
import numpy as np
def load_glove_embeddings(glove_directory,emsize=50,voc_size=50000):
    #get directory name glove.6B or other training corpus size
    if glove_directory[-1] =='/':
        dirname = glove_directory.split('/')[-2]
    else:
        dirname = glove_directory.split('/')[-1]
    if emsize in [50,100,300]:
        f = open(os.path.join(glove_directory,'%s.%sd.txt'%(dirname,emsize)))
    else:
        print('Please select from 50, 100 or 300')
        return
    loaded_embeddings = collections.defaultdict()
    for i, line in enumerate(f):
        if i >= voc_size: 
            break
        s = line.split()
        loaded_embeddings[s[0]] = np.asarray(s[1:],dtype='float64')
    return loaded_embeddings

In [10]:
# Example: GloVe Extraction
loaded_embeddings = load_glove_embeddings('../datasets/glove.6B/')

In [12]:
print(loaded_embeddings['victory'])

[-0.66385   0.41015   0.073617  0.85937   0.30031  -0.11978  -0.45367
  1.4574   -0.73222   0.28086  -0.7589   -1.2996   -0.96887  -0.57294
 -0.25255  -0.7098    0.52366  -1.3184   -1.7125   -0.074232 -1.2343
 -0.37677  -0.4526   -0.95694   0.36827  -1.8201   -0.20622  -0.31884
  0.1527   -0.30461   2.3935    1.3234   -1.0144    0.35188  -0.17079
 -0.67128   0.38904   0.94105  -0.42382  -1.3848    0.15837  -0.59283
 -0.80945  -0.46636  -0.086871  1.419    -0.5528   -0.19525   0.43202
 -0.6991  ]


In [17]:
# Save embeddings for future use
import pickle
with open('glove_50d_50kvoc.pk', 'wb') as fp:
    pickle.dump(loaded_embeddings, fp)