# WordNet
A database of semantic relations from various languages is called WordNet. Synsets (synonym sets) are collections of words that have similar meanings that can be used to determine their definitions, use cases, and lemmas.

In [74]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('book')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to C:\Users\Quang
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to C:\Users\Quang
[nltk_data]     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to C:\Users\Quang
[nltk_data]    |     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to C:\Users\Quang
[nltk_data]    |     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to C:\Users\Quang
[nltk_data]    |     Nguyen\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to C:\Users\Quang
[nltk_data]    |     Nguyen\App

## Nouns
I selected a noun as an example and outputted all of its synsets, which will be a list of synsets that are relevant to the select word.

In [75]:
# output all synsets from noun
wn.synsets('dress')

[Synset('dress.n.01'),
 Synset('attire.n.01'),
 Synset('apparel.n.01'),
 Synset('dress.v.01'),
 Synset('dress.v.02'),
 Synset('dress.v.03'),
 Synset('dress.v.04'),
 Synset('preen.v.03'),
 Synset('dress.v.06'),
 Synset('dress.v.07'),
 Synset('trim.v.06'),
 Synset('dress.v.09'),
 Synset('dress.v.10'),
 Synset('snip.v.02'),
 Synset('dress.v.12'),
 Synset('dress.v.13'),
 Synset('dress.v.14'),
 Synset('dress.v.15'),
 Synset('dress.v.16'),
 Synset('full-dress.s.01'),
 Synset('dress.s.02')]

Let's now select one of the synset from the list of synsets to extract its definition, usage examples, and lemmas.

In [76]:
# extract definition 
print('---- Definitions ----')
wn.synset('apparel.n.01').definition()

---- Definitions ----


'clothing in general'

In [77]:
# extract usage examples
wn.synset('apparel.n.01').examples()

['she was refined in her choice of apparel',
 'he always bought his clothes at the same store',
 'fastidious about his dress']

In [78]:
# extract lemmas
wn.synset('apparel.n.01').lemmas()

[Lemma('apparel.n.01.apparel'),
 Lemma('apparel.n.01.wearing_apparel'),
 Lemma('apparel.n.01.dress'),
 Lemma('apparel.n.01.clothes')]

It's also possible to view the entire hierarchy of the selected word and see the hypernyms of the word.

In [79]:
# Traverse up the WordNet hierarchy

hy = wn.synset('apparel.n.01').hypernyms()[0]
# hierarchy for nouns has 'entity' at the top
top = wn.synset('entity.n.01')

while hy:
    print(hy)
    if hy == top:
        break
    if hy.hypernyms():
        hy = hy.hypernyms()[0]


Synset('clothing.n.01')
Synset('consumer_goods.n.01')
Synset('commodity.n.01')
Synset('artifact.n.01')
Synset('whole.n.02')
Synset('object.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')


From the observation, it is apparent that the hierarchy is set up so that each noun is classified as a subclass of its parent. This trend continues until we encounter the 'entity' noun, which encompasses all nouns.

Below, the code shows various ways to get hierarchical relations. There will be an empty list when there are no applicable words that exist.

In [80]:
# output hypernymns, hyponyms, meronyms, holonyms, antonyms (or an empty list if none exist)

# Hypernyms
print('Hypernyms: ', wn.synset('apparel.n.01').hypernyms())

# Hyponyms
print('Hyponyms: ',  wn.synset('apparel.n.01').hyponyms())

# Meronyms
print('Meronyms: ', wn.synset('apparel.n.01').part_meronyms(), wn.synset('apparel.n.01').substance_meronyms())

# Holonyms
print('Holoynyms: ', wn.synset('apparel.n.01').member_holonyms())

# Antonyms
print('Antonyms: ', wn.synset('apparel.n.01').lemmas()[0].antonyms())

Hypernyms:  [Synset('clothing.n.01')]
Hyponyms:  [Synset('workwear.n.01')]
Meronyms:  [] []
Holoynyms:  []
Antonyms:  []


## Verbs
Now let's try an example with verbs.

In [81]:
# output all synsets from verb
wn.synsets('sing')

[Synset('sing.v.01'),
 Synset('sing.v.02'),
 Synset('sing.v.03'),
 Synset('whistle.v.05'),
 Synset('spill_the_beans.v.01')]

Let's extract the definition, usage examples, and lemmas from the selected synset of the verb variation of the word chosen above.

In [82]:
# extract definition 
wn.synset('sing.v.02').definition()

'produce tones with the voice'

In [83]:
# extract usage examples
wn.synset('sing.v.02').examples()

['She was singing while she was cooking', 'My brother sings very well']

In [84]:
# extract lemmas
wn.synset('sing.v.02').lemmas()

[Lemma('sing.v.02.sing')]

Below, the code is traversing up the hierarchy of the selected verb and outputting the synsets as it goes. 

In [85]:
# Traverse up the WordNet hierarchy
sing = wn.synset('sing.v.02')
hy = lambda s: s.hypernyms()
list(sing.closure(hy))

[Synset('talk.v.02'),
 Synset('communicate.v.02'),
 Synset('interact.v.01'),
 Synset('act.v.01')]

From the observation of the way verbs are organized in WordNet is that there is no common hypernym for all verbs because of the difference between the organization of nouns and verbs. Unlike nouns, each word terminates its hierarchy in different places

## Morphy
morphy() function will return the base form of the selected word.

In [86]:
# use morphy to find many different forms of the word
print(wn.morphy('loving'))
print(wn.morphy('hugged', wn.VERB))
print(wn.morphy('hugged', wn.ADV))

love
hug
None


## Wu-Palmer Similarity Metric and Lesk Algorithm
To assess how closely related two words are in terms of how they are used in a language, a similarity measure is frequently utilized. Usually, the score is assigned between 0 (few similarities) and 1 (identity).

Below, I chose two words that I believe are similar to a certain degree to demonstrate Wu-Palmer Similarity Metric, which is based on two words and their most explicit common ancester node.

In [87]:
# Wu-Palmer Similarity Metric
wn.wup_similarity(wn.synset('lady.n.01'), wn.synset('female.n.01'))

0.6666666666666666

Now, I am trying the Lesk Algorithm, which returns the synset with the most overlapped words between a given context phrase and each synset's definitions for the selected word. Additionally, we can give a pos argument for the word.

In [88]:
# Lesk Algorithm
from nltk.wsd import lesk

for ss in wn.synsets('hit'):
    print(ss, ss.definition())

Synset('hit.n.01') (baseball) a successful stroke in an athletic contest (especially in baseball)
Synset('hit.n.02') the act of contacting one thing with another
Synset('hit.n.03') a conspicuous success
Synset('collision.n.01') (physics) a brief event in which two or more bodies come together
Synset('hit.n.05') a dose of a narcotic drug
Synset('hit.n.06') a murder carried out by an underworld syndicate
Synset('hit.n.07') a connection made via the internet to another website
Synset('hit.v.01') cause to move by striking
Synset('hit.v.02') hit against; come into sudden contact with
Synset('hit.v.03') deal a blow to, either with the hand or with an instrument
Synset('reach.v.01') reach a destination, either real or abstract
Synset('hit.v.05') affect or afflict suddenly, usually adversely
Synset('shoot.v.01') hit with a missile from a weapon
Synset('stumble.v.03') encounter by chance
Synset('score.v.01') gain points in a game
Synset('hit.v.09') cause to experience suddenly
Synset('strike.v.

In [89]:
# example sentence
sentence = 'It was a musical hit'
words = sentence.split()

print(lesk(words, 'hit', 'n'))
print(lesk(words, 'hit'))

Synset('hit.n.07')
Synset('shoot.v.01')


In [90]:
# example sentence
sentence = 'Can I take a hit of that'
words = sentence.split()

print(lesk(words, 'hit', 'n'))
print(lesk(words, 'hit'))

Synset('hit.n.05')
Synset('shoot.v.01')


Determining the right synset from which a context-definition sentence's originates will be easier if the pos is specificied. Although, from my observation of the algorithm, you can notice that the algorithm mistakenly produced a synset with a definition that has nothing to do with my sentence even with the aid of the pos tagging. So the algorithm isn't always correct with the pos tagging

## SentiWordNet
SentiWordNet was designed for opinion mining, meaning that it was used to provide sentiment scores for positivitiy, negativity, and objectivity for a selected synset.

Each value is always between 0 and 1, and the sum of the three scores is 1.0.

I chose an emotionally charged word and the code below gives the polarity scores for each of the synsets.

In [91]:
from nltk.corpus import sentiwordnet as swn

# an emotionally charged word
expect = swn.senti_synsets('expectation')

# find its senti-synsets and output the polarity scores for each word
for ss in list(expect):
    expect = ss
    print(expect)
    print('Positive Score: ', expect.pos_score())
    print('Negative Score: ', expect.neg_score())
    print('Objective Score: ', expect.obj_score(), '\n')


<expectation.n.01: PosScore=0.0 NegScore=0.0>
Positive Score:  0.0
Negative Score:  0.0
Objective Score:  1.0 

<anticipation.n.04: PosScore=0.5 NegScore=0.0>
Positive Score:  0.5
Negative Score:  0.0
Objective Score:  0.5 

<expectation.n.03: PosScore=0.0 NegScore=0.125>
Positive Score:  0.0
Negative Score:  0.125
Objective Score:  0.875 

<arithmetic_mean.n.01: PosScore=0.0 NegScore=0.0>
Positive Score:  0.0
Negative Score:  0.0
Objective Score:  1.0 



Now, let's try to get the polarity score of each word in the sentence.

In [92]:
# variables
sentence = 'expecting to have a delivery today'
neg = 0
pos = 0
words = sentence.split()

print(words)
# output polarity for each word in the sentence
for w in words:
    ss_list = list(swn.senti_synsets(w))

    if ss_list:
        syn = ss_list[0] 
        print(syn)
        print('Positive Score: ', syn.pos_score())
        print('Negative Score: ', syn.neg_score())
        print('Objective Score: ', syn.obj_score(), '\n')


['expecting', 'to', 'have', 'a', 'delivery', 'today']
<expect.v.01: PosScore=0.25 NegScore=0.25>
Positive Score:  0.25
Negative Score:  0.25
Objective Score:  0.5 

<rich_person.n.01: PosScore=0.0 NegScore=0.0>
Positive Score:  0.0
Negative Score:  0.0
Objective Score:  1.0 

<angstrom.n.01: PosScore=0.0 NegScore=0.0>
Positive Score:  0.0
Negative Score:  0.0
Objective Score:  1.0 

<delivery.n.01: PosScore=0.0 NegScore=0.0>
Positive Score:  0.0
Negative Score:  0.0
Objective Score:  1.0 

<today.n.01: PosScore=0.125 NegScore=0.0>
Positive Score:  0.125
Negative Score:  0.0
Objective Score:  0.875 



From the observation of the scores and utility of knowing these scores in an NLP application, I have noticed that each key words returns their polarity scores. Also, stopwords are not taken into account because SentiWordNet does not classify them, so they are ignored. Future NLP applications that needs a program to analyze the sentiment behind a text will significantly benefit from using this. For an example, a program that may need to know if the user feels happy about the topic or find them dissatisfied with it. 

## Collocation
Collocation is the natural juxtaposition of two or more words to create a deeper meaning than the simple coincidence of their placement. For example, the collocation 'gap year' means more than the individual words can express.

In [111]:
# output collocations for text4, Inaugural Corpus
from nltk.book import text4
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


Now, let's select one of the collocations identified by NLTK and calculate mutual information

In order to calculate, we need to know that mutual information is the log of the probablity:

P(x,y) / [P(x) * P(y)]

In [112]:
# calculate mutual information
import math

# txt
txt = ' '.join(text4.tokens)

vocab = len(set(text4))

# words
fellow = txt.count('fellow') / vocab
print('p(\'fellow\'): ', fellow)

citizens = txt.count('citizens') / vocab
print('p(\'citizens\'): ', citizens)

fellow_citizens = txt.count('fellow citizens') / vocab
print('p(\'fellow citizens\'): ', fellow_citizens)

# calculate
pmi = math.log2(fellow_citizens / (fellow * citizens))

print('Mutual Information Score: ', pmi)

p('fellow'):  0.013665835411471322
p('citizens'):  0.026932668329177057
p('fellow citizens'):  0.006084788029925187
Mutual Information Score:  4.0472042737811735


My Commentary on the Results of the Mutual Information Formula and my Interpretation

The level of non-randomness that exists when the two words appear in text is determined by the number that is output during the MI score calculation. Mutual information is essential in determining how significant collocation is in a particular text and suggests that the target words might be attracted to one another in both directions.