# WordNet

One of the earliest attempts to create useful representations of meaning for language is [WordNet](https://en.wikipedia.org/wiki/WordNet) -- a lexical database of words and their relationships.

NLTK provides a [WordNet wrapper](http://www.nltk.org/howto/wordnet.html) that we'll use here.

Documentation and Examples: https://www.nltk.org/howto/wordnet.html

In [None]:
import nltk
assert(nltk.download('wordnet'))  # Make sure we have the wordnet data.
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...


## Synsets
The fundamental WordNet unit is a **synset**, specified by a word form, a part of speech, and an index. The synsets() function retrieves the synsets that match the given word. For example, there are 4 synsets for the word "surf", one of which is a noun (n) and three of which are verbs (v). WordNet provides a definition and sometimes glosses (examples) for each synset. **Polysemy**, by the way, means having multiple senses.

In [None]:
wn.synset('dog.n.01')

Synset('dog.n.01')


For this particular synset we can fetch the definition:


In [None]:
print(wn.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [None]:
# example
print(wn.synset('dog.n.01').examples()[0])

the dog barked all night


In [None]:
for s in wn.synsets('surf'):
    print(s)
    print('\t', s.definition())
    print('\t', s.examples())

Synset('surf.n.01')
	 waves breaking on the shore
	 []
Synset('surfboard.v.01')
	 ride the waves of the sea with a surfboard
	 ['Californians love to surf']
Synset('browse.v.03')
	 look around casually and randomly, without seeking anything in particular
	 ['browse a computer directory', 'surf the internet or the world wide web']
Synset('surf.v.03')
	 switch channels, on television
	 []


Synsets as object


In [None]:
dog = wn.synset('dog.n.01')

In [None]:
dog.hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

In [None]:
dog.hyponyms()

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

In [None]:
dog.member_holonyms()

[Synset('canis.n.01'), Synset('pack.n.06')]

In [None]:
dog.root_hypernyms()

[Synset('entity.n.01')]

In [None]:
wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('cat.n.01'))

[Synset('carnivore.n.01')]

## Lemmas and synonyms
Each synset includes its corresponding **lemmas** (word forms).

We can construct a set of synonyms by looking up all the lemmas for all the synsets for a word.

In [None]:
synonyms = set()

for s in wn.synsets('triumphant'):
    for l in s.lemmas():
        synonyms.add(l.name())

print ('synonyms:', ', '.join(synonyms))

synonyms: exulting, prideful, jubilant, victorious, triumphal, exultant, triumphant, rejoicing


In [None]:
wn.synset('dog.n.01').lemmas()

[Lemma('dog.n.01.dog'),
 Lemma('dog.n.01.domestic_dog'),
 Lemma('dog.n.01.Canis_familiaris')]

In [None]:
# get all lemma into list
dog_lemmas = [str(lemma.name()) for lemma in wn.synset('dog.n.01').lemmas()]
dog_lemmas

['dog', 'domestic_dog', 'Canis_familiaris']

## Word hierarchies

WordNet organizes nouns and verbs into hierarchies according to hypernym or is-a relationships.

Let's examine the path from "rutabaga" to its root in the tree, "entity".

In [None]:
s = wn.synsets('rutabaga')

while s:
    print (s[0].hypernyms())
    s = s[0].hypernyms()

[Synset('turnip.n.02')]
[Synset('cruciferous_vegetable.n.01'), Synset('root_vegetable.n.01')]
[Synset('vegetable.n.01')]
[Synset('produce.n.01')]
[Synset('food.n.02')]
[Synset('solid.n.01')]
[Synset('matter.n.03')]
[Synset('physical_entity.n.01')]
[Synset('entity.n.01')]
[]


Actually, the proper way to do this is with a transitive closure, which repeatedly applies the specified function (in this case, hypernyms()).

In [None]:
hyper = lambda x: x.hypernyms()
s = wn.synset('rutabaga.n.01')
for i in list(s.closure(hyper)):
    print (i)
print
ss = wn.synset('root_vegetable.n.01')
for i in list(ss.closure(hyper)):
    print (i)

Synset('turnip.n.02')
Synset('cruciferous_vegetable.n.01')
Synset('root_vegetable.n.01')
Synset('vegetable.n.01')
Synset('vegetable.n.01')
Synset('produce.n.01')
Synset('produce.n.01')
Synset('food.n.02')
Synset('food.n.02')
Synset('solid.n.01')
Synset('solid.n.01')
Synset('matter.n.03')
Synset('matter.n.03')
Synset('physical_entity.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')
Synset('entity.n.01')
Synset('vegetable.n.01')
Synset('produce.n.01')
Synset('food.n.02')
Synset('solid.n.01')
Synset('matter.n.03')
Synset('physical_entity.n.01')
Synset('entity.n.01')


## Measuring similarity

WordNet's word hierarchies (for nouns and verbs) allow us to measure similarity in various ways.

Path similarity is defined as:

> $1 / (ShortestPathDistance(s_1, s_2) + 1)$

where $ShortestPathDistance(s_1, s_2)$ is computed from the hypernym/hyponym graph.

In [None]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
dog.path_similarity(cat)

0.2

In [None]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
wn.path_similarity(hit, slap)

0.14285714285714285

In [None]:
s1 = wn.synset('dog.n.01')
s2 = wn.synset('cat.n.01')
s3 = wn.synset('potato.n.01')

print (s1, '::', s1, s1.path_similarity(s1))
print (s1, '::', s2, s1.path_similarity(s2))
print (s1, '::', s3, s1.path_similarity(s3))
print (s2, '::', s3, s2.path_similarity(s3))

hyper = lambda x: x.hypernyms()
print(s1.hypernyms())

for i in list(s1.closure(hyper)):
    print (i)

Synset('dog.n.01') :: Synset('dog.n.01') 1.0
Synset('dog.n.01') :: Synset('cat.n.01') 0.2
Synset('dog.n.01') :: Synset('potato.n.01') 0.07142857142857142
Synset('cat.n.01') :: Synset('potato.n.01') 0.05263157894736842
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Synset('canine.n.02')
Synset('domestic_animal.n.01')
Synset('carnivore.n.01')
Synset('animal.n.01')
Synset('placental.n.01')
Synset('organism.n.01')
Synset('mammal.n.01')
Synset('living_thing.n.01')
Synset('vertebrate.n.01')
Synset('whole.n.02')
Synset('chordate.n.01')
Synset('object.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')


### Leacock-Chodorow Similarity
`synset1.lch_similarity(synset2)`: Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur.

The relationship is given as `-log(p/2d)` where p is the shortest path length and d the taxonomy depth.

In [None]:
dog.lch_similarity(cat)

2.0281482472922856

### Wu-Palmer Similarity

synset1.wup_similarity(synset2): Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do not always agree with those given by Pedersen’s Perl implementation of Wordnet Similarity.

The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.

In [None]:
dog.wup_similarity(cat)

0.8571428571428571

In [None]:
hit.wup_similarity(slap)

0.25

### Information Content
`wordnet_ic` Load an information content file from the `wordnet_ic` corpus.

In [None]:
nltk.download('wordnet_ic')

[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


True

In [None]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

Or you can create an information content dictionary from a corpus (or anything that has a `words()` method).

In [None]:
nltk.download('genesis')

[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!


True

In [None]:
from nltk.corpus import genesis
genesis_ic = wn.ic(genesis, False, 0.0)

In [None]:
genesis_ic

### Resnik Similarity
`synset1.res_similarity(synset2, ic)`: Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.


In [None]:
dog.res_similarity(cat, brown_ic)

7.911666509036577

In [None]:
dog.res_similarity(cat, genesis_ic)

7.204023991374833

## Multilingual functions


The current version of WordNet in NLTK is multilingual.

The WordNet corpus reader gives access to the Open Multilingual WordNet, using ISO-639 language codes. These languages are not loaded by default, but only lazily, when needed.


To see which languages are supported, use this command:


In [None]:
sorted(wn.langs())

In [None]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [None]:
sorted(wn.langs())

['als',
 'arb',
 'bul',
 'cat',
 'cmn',
 'dan',
 'ell',
 'eng',
 'eus',
 'fin',
 'fra',
 'glg',
 'heb',
 'hrv',
 'ind',
 'isl',
 'ita',
 'ita_iwn',
 'jpn',
 'lit',
 'nld',
 'nno',
 'nob',
 'pol',
 'por',
 'ron',
 'slk',
 'slv',
 'spa',
 'swe',
 'tha',
 'zsm']

In [None]:
wn.synset('spy.n.01').lemma_names('cat')

['agent_secret', 'espia']

In [None]:
wn.synset('dog.n.01').lemma_names('ita')

['Canis_familiaris', 'cane']

In [None]:
wn.synset('dog.n.01').lemmas('por')

[Lemma('dog.n.01.cachorra'),
 Lemma('dog.n.01.cachorro'),
 Lemma('dog.n.01.cadela'),
 Lemma('dog.n.01.cão')]