#  Lexical Semantics

In this notebook we will use NLTK to access WordNet, look at some senses and lexical relations, and find paths between words. First, let's load NLTK and make sure WordNet is accessible

In [None]:
from nltk.corpus import wordnet as wn
print wn.readme(lang="eng")

As mentioned in lecture, the main nodes in WordNet are synsets, not words. Given any word, we can access relevant synsets using the synsets commands. We can optionally limit to a particular word category (n = noun, v = verb, a = adjective, r = adverb). For each of the synsets of the word type "class", let's look at their definition, their corresponding lemmas, an example of their usage, and their hypernyms (often only one, but can be multiple).

In [None]:
for synset in wn.synsets("book","n"):
    print synset.name()
    print synset.definition()
    print synset.lemma_names()
    print synset.examples()
    print synset.hypernyms()
    print "-------"

We can see here why WordNet is sometimes seen as too fine-grained, particularly for word sense disambiguation; several of these senses are closely related to each other in meaning. WordNet does not ditinguish between true homonyms, and instances of polysemy. In any case, once we know its name, we can access a particular synset with the synset command, and look at other relationships, such as hyponyms; Note that meronyms and holonyms come in three types: part, member or substance, though we'll only look at part here.

In [None]:
print wn.synset("book.n.02").hyponyms()
print wn.synset("book.n.02").part_meronyms()
print wn.synset("book.n.10").part_holonyms() # "book" meaning a division of a text

Each synset has a set of lemmas associated with it. Since antonyms are often specific to the word form, they are defined on lemmas, not synsets. Another function, derivationally_related_forms gives other lemmas which are related by derivational morphology, though this is not comprehensive. Finally, lemmas have a count associated with them, derived from a sense tagged corpus: these can be used to identify which senses of a word are more common.

In [None]:
print wn.synsets("happy")[0]
print wn.synsets("happy")[0].lemmas()[0].antonyms()
print wn.synsets("happy")[0].lemmas()[0].derivationally_related_forms()
print wn.synsets("happy")[0].lemmas()[0].count()


All of the basic similarity measures mentioned in class (and several others) are available in the NLTK WordNet interface, as are other functions which are used in their derivation. For similarity metrics which require information content, we can load statistics from available corpora (the SEMCOR and Brown corpora are popular options).

In [None]:
from nltk.corpus import wordnet_ic

print wn.synset("book.n.02").path_similarity(wn.synset("newspaper.n.03"))
print wn.synset("book.n.02").wup_similarity(wn.synset("newspaper.n.03"))

semcor_ic = wordnet_ic.ic('ic-semcor.dat')

print wn.synset("book.n.02").lin_similarity(wn.synset("newspaper.n.03"),semcor_ic) 


However, they are somewhat opaque in their operation, and only work on synsets. Let's create a version of basic path distance which doesn't require you to select a specific synset in advance, and shows you the exact path through the WordNet heirarchy that the score is based on. There are many ways to do this, we'll do it in a fairly clear but not entirely optimal way. First, given a set of synsets, let's get a dictionary where the keys correspond to all hypernym synsets, and the values are the next step below on the shortest past to one of the initial synsets.

In [None]:
def get_hypernym_path_dict(synsets):
    hypernym_dict = {}
    synsets_to_expand = synsets
    while synsets_to_expand:
        new_synsets_to_expand = set()
        for synset in synsets_to_expand:
            for hypernym in synset.hypernyms():
                if hypernym not in hypernym_dict:  # this ensures we get the shortest path
                    hypernym_dict[hypernym] = synset
                    new_synsets_to_expand.add(hypernym)
        synsets_to_expand = new_synsets_to_expand
    return hypernym_dict
        
        
hypernym_dict = get_hypernym_path_dict(wn.synsets("book","n"))
print hypernym_dict
    
    

We also need a way to build the path using this information

In [None]:
def get_path_using_hypernym_dict(hypernym,hypernym_dict,synsets):
    path = [hypernym]
    current_synset = hypernym_dict[hypernym]
    while current_synset not in synsets:
        path.append(current_synset)
        current_synset =  hypernym_dict[current_synset]
    path.append(current_synset)
    return path
    
print get_path_using_hypernym_dict(wn.synset('physical_entity.n.01'),hypernym_dict,wn.synsets("book","n"))
        

Now we can build ancestor dictionaries for each of the words, look at the intersection, and then find the shortest path

In [None]:
def get_shortest_path_between(word1,word2):
    synsets1 = wn.synsets(word1)
    synsets2 = wn.synsets(word2)
    hypernym_dict1 = get_hypernym_path_dict(synsets1)
    hypernym_dict2 = get_hypernym_path_dict(synsets2)
    best_path = []
    for hypernym in hypernym_dict1:
        if hypernym in hypernym_dict2 and hypernym_dict1[hypernym] != hypernym_dict2[hypernym]:
            path1 = get_path_using_hypernym_dict(hypernym,hypernym_dict1,synsets1)
            path2 = get_path_using_hypernym_dict(hypernym,hypernym_dict2,synsets2)
            if not best_path or len(path1) + len(path2) - 1 < len(best_path):
                path1.reverse()
                best_path = path1 + path2[1:]
    return best_path

path = get_shortest_path_between("book","newspaper")
print 1.0/len(path)
print path    

path = get_shortest_path_between("dog","cat")
print 1.0/len(path)
print path  

path = get_shortest_path_between("nickel","money")
print 1.0/len(path)
print path  

path = get_shortest_path_between("computer","pizza")
print 1.0/len(path)
print path  

The shortest path does not always correspond to the most obvious relationship between two words: for instance, newspaper and book are join as products (not reading materials), dog and cat by informal senses related to people, rather than animals. Using depth and information-content basic metrics can improve this situation. Another approach is to use the counts of lemmas to ignore rare senses. Note that doing all this for other metrics is somewhat different, because they are based on the idea of lowest common subsumer, which is not necessarily on the shortest path.