# Summary of WordNet

[WordNet](https://wordnet.princeton.edu/) a project that provides a large lexical database of English nouns, verbs, adjectives and adverbs, which are grouped into synsets, which are sets of "cognitive synonyms" [1]. The intersting thing about WordNet is that it also contains information on the hierarchical relationships between these synsets, making it a powerful tool for natural language processing (NLP) applications.



---


**References**

[1] Princeton University "About WordNet." WordNet. Princeton University. 2010.

In [None]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk.wsd import lesk
from nltk.book import *
import math

In [None]:
# downloads

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /root/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to

True

In [None]:
# get all synsets of 'painting'
wn.synsets('painting', pos=wn.NOUN)

[Synset('painting.n.01'),
 Synset('painting.n.02'),
 Synset('painting.n.03'),
 Synset('painting.n.04')]

In [None]:
# get a definition for the first noun in synset
wn.synset('painting.n.01').definition()

'graphic art consisting of an artistic composition made by applying paints to a surface'

In [None]:
# extract usage examples
wn.synset('painting.n.01').examples()

['a small painting by Picasso',
 'he bought the painting as an investment',
 'his pictures hang in the Louvre']

In [None]:
# extract lemmas
wn.synset('painting.n.01').lemmas()

[Lemma('painting.n.01.painting'), Lemma('painting.n.01.picture')]

In [None]:
# Traverse up the WordNet hierarchy 
hyper = lambda s: s.hypernyms()
list(wn.synset('painting.n.01').closure(hyper))

[Synset('graphic_art.n.01'),
 Synset('art.n.01'),
 Synset('creation.n.02'),
 Synset('artifact.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]

# Analysis on how WordNet organizes nouns
Nouns are organized in a hierarchical structure with the synset 'entity.n.01' at the top, analogous to how many object-oriented programming languages employ a super-type 'Object' to serve as the parent of all other types. 

Nouns are grouped into synsets, each synset representing a group of words that have the same meaning and POS. The same spelling for a word may have different synsets for each meaning that the word has.

In [None]:
# output hypernyms (higher)
hyper = [h.name() for h in wn.synset('painting.n.01').hypernyms()]
print("Hypernyms:", hyper)

# output hyponyms (lower)
hypo = [h.name() for h in wn.synset('painting.n.01').hyponyms()]
print("Hyponyms:", hypo)

# output meronyms (part of)
mero = [h.name() for h in wn.synset('painting.n.01').part_meronyms()]
print("Meronyms:", mero)

# output holonyms (whole)
holo = [h.name() for h in wn.synset('painting.n.01').part_holonyms()]
print("Holonyms:", holo)

# output antonyms
ant = []
for lemma in wn.synset('painting.n.01').lemmas():
  for antonym in lemma.antonyms():
    ant.append(antonym.name())
    
print("Antonyms:", ant)

Hypernyms: ['graphic_art.n.01']
Hyponyms: ['abstraction.n.04', 'cityscape.n.02', 'daub.n.03', 'distemper.n.04', 'finger-painting.n.01', 'icon.n.03', 'landscape.n.02', 'miniature.n.01', 'monochrome.n.01', 'mural.n.01', 'nude.n.01', 'oil_painting.n.01', 'pentimento.n.01', 'sand_painting.n.01', 'seascape.n.02', 'semi-abstraction.n.01', 'still_life.n.01', 'tanka.n.02', "trompe_l'oeil.n.01", 'watercolor.n.01']
Meronyms: []
Holonyms: []
Antonyms: []


In [None]:
# get all synsets of 'paint' verb
wn.synsets('painting', pos=wn.VERB)

[Synset('paint.v.01'),
 Synset('paint.v.02'),
 Synset('paint.v.03'),
 Synset('paint.v.04')]

In [None]:
print(wn.synsets('painting', pos=wn.VERB)[0])

# extract definition
definition = wn.synset('paint.v.01').definition()
print("Definition:", definition)

# extract usage examples
usage_examples = wn.synset('paint.v.01').examples()
print("Usage examples:", usage_examples)

# extract lemmas
lemmas = wn.synset('paint.v.01').lemmas()
print("Lemmas:", lemmas)

Synset('paint.v.01')
Definition: make a painting
Usage examples: ['he painted all day in the garden', 'He painted a painting of the garden']
Lemmas: [Lemma('paint.v.01.paint')]


In [None]:
# traverse up hierarchy
hyp = lambda s: s.hypernyms()
hierarchy = list(wn.synset('paint.v.01').closure(hyp))
print("Hierarchy of 'paint.v.01':", hierarchy)

Hierarchy of 'paint.v.01': [Synset('create.v.03'), Synset('act.v.01')]


# Analysis on how WordNet is organized for verbs
As opposed to the hierarchy of nouns in WordNet, verbs do not have a top level synset. 

The synset for verbs also seem to be defined using the base form of the verb, also known as the present infinitive tense. For example, `wn.synsets('painting', pos=wn.VERB)` returned synsets in the form 'paint.v.XX'

In [None]:
from nltk.stem.porter import *
stemmer = PorterStemmer()

word = "remembering"
synsets = wn.synsets(word)

s = set()
for synset in synsets:
  lemmas = synset.lemmas()
  for lemma in lemmas:
    forms = lemma.derivationally_related_forms()
    for form in forms:
      name = form.name()
      if stemmer.stem(name) == stemmer.stem(word):
        s.add(name)

print(s)

{'remembering', 'remember'}


In [None]:
# Select two words that you think might be similar. Find the specific synsets you are interested in.
word1 = wn.synsets("goal")[0]
word2 = wn.synsets("aim")[1]

print(word1, ":", word1.definition())
print(word2, ":", word2.definition())

# Wu-Palmer similarity Metric
print("Wu-Palmer Similarity:", wn.wup_similarity(word1, word2))

# Run the Lesk Algorithm,
print("\nRunning the Lesk algorith")
sent1 = ["The", "goal", "is", "to", "pass", "the", "class"] 
sent2 = ["The", "aim", "is", "to", "pass", "the", "class"]
sent3 = ["He", "took", "aim", "and", "fired"]

print(lesk(sent1, "goal", "n"))
print(lesk(sent2, "aim", "n"))
print(lesk(sent3, "aim", "n"))
print()

print("synset('purpose.n.01'):", wn.synset('purpose.n.01').definition())

Synset('goal.n.01') : the state of affairs that a plan is intended to achieve and that (when achieved) terminates behavior intended to achieve it
Synset('aim.n.02') : the goal intended to be attained (and which is believed to be attainable)
Wu-Palmer Similarity: 0.9230769230769231

Running the Lesk algorith
Synset('goal.n.01')
Synset('aim.n.02')
Synset('purpose.n.01')

synset('purpose.n.01'): an anticipated outcome that is intended or that guides your planned actions


# My Observations on the Wu-Palmer metric and the Lesk Algorithm
I used the words 'goal' and 'aim' to see if the multiple meanings of the word 'aim' would confuse either of these algorithms. 

Specifically, I used the following synsets:
- Synset('goal.n.01') : the state of affairs that a plan is intended to achieve and that (when achieved) terminates behavior intended to achieve it
- Synset('aim.n.02') : the goal intended to be attained (and which is believed to be attainable)


The Wu-Palmer similarity metric scored a `0.9230769230769231`, indicating that it correctly identified the similarity between the two synsets. 

The Lesk algorithm, however, does not take in a synset, but rather just a word. I tried to confuse it by using the word 'aim', which can have an ambiguous meaning depending on the context. In the sentnces "The goal/aim is to pass the class", the Lesk algorithm correctly identified the synsets I intended it to. In the sentence "he took aim and fired", however, the Lesk algorithm returned `synset('purpose.n.01'): an anticipated outcome that is intended or that guides your planned actions`, which was not the intended meaning. 

 # My analysis of SentiWordNet
 Similar to WordNet's `synset`, SentiWordNet works with a `senti_synset` for a word. Each synset has a `positive`, `negative`, and `objectivity` score. It is important to note that these scores are assigned to a `senti_synset`, and thus SentiWordNet does not take into account a words context in the sentence or corpus as a whole, making it less accurate than tools such as [Vader](https://github.com/cjhutto/vaderSentiment).

 Still, I believe that SentiWordNet has many possible use cases, such as the following: 
 - Automatically assigning ratings to movies based on mentions in social media applications. For example one could use Twitter's API to get all tweets that mention the movie, and then aggregate the SentiWordNet scores of those tweets to gage whether the reviews are mostly positive or mostly negative.
 - Help a company gage the public opinion of their brand. One could make a Web Crawler that searches for mentions of the company on the Web, and then aggregates positivity and negativity scores to gage the public opinion on the company. This product could even single out some of the most negative mentions and report those to the company, allowing their PR team to focus on the main areas of concern



In [None]:
syn_list = list(swn.senti_synsets("hunger"))
for syn in syn_list:
  print(syn)

print()

sent = "we have to solve world hunger it is unnaceptable that people are still dying of starvation"
tokens = sent.split()
for token in tokens:
  syn_list = list(swn.senti_synsets(token))
  if syn_list:
    print(token, ": ", syn_list[0])

print()

# using Lesk and POS tagging
# sent = "we have to solve world hunger it is unnaceptable that people are still dying of starvation"
# tokens = nltk.word_tokenize(sent)
# tags = nltk.pos_tag(tokens)
# for tag in tags:
#   print(tag)
#   synset = lesk(sent, tag[0], tag[1]) 
#   print(synset)

<hunger.n.01: PosScore=0.0 NegScore=0.0>
<hunger.n.02: PosScore=0.25 NegScore=0.375>
<hunger.v.01: PosScore=0.0 NegScore=0.125>
<crave.v.01: PosScore=0.5 NegScore=0.0>
<starve.v.01: PosScore=0.0 NegScore=0.25>

have :  <rich_person.n.01: PosScore=0.0 NegScore=0.0>
solve :  <solve.v.01: PosScore=0.0 NegScore=0.0>
world :  <universe.n.01: PosScore=0.0 NegScore=0.0>
hunger :  <hunger.n.01: PosScore=0.0 NegScore=0.0>
it :  <information_technology.n.01: PosScore=0.0 NegScore=0.0>
is :  <be.v.01: PosScore=0.25 NegScore=0.125>
people :  <people.n.01: PosScore=0.0 NegScore=0.0>
are :  <are.n.01: PosScore=0.0 NegScore=0.0>
still :  <still.n.01: PosScore=0.0 NegScore=0.0>
dying :  <death.n.04: PosScore=0.0 NegScore=0.625>
starvation :  <starvation.n.01: PosScore=0.125 NegScore=0.0>



#  Observation on SentiWordNet Scores and their Utility in NLP Applications
I found it interesting that some of the synsets of "hunger" have a 0 `pos` and `neg` score, while others do not. This highlights the fact that, in order for SentiWordNet to be useful, you need to first identify the correct synset for every word in a sentence. 

I believe that SentiWordNet cannot be used to its full capabilty without combining it with something like the Lesk algorithm and POS tagging in order to be able to identity the most probable `senti_synset`. That being said, if integrated correctly with these other tools, I believe that SentiWordNet can be very powerful in detecting the positivity or negativity of a piece of text. Knowing this positivity/negativity information has some intrinsic value and can be directly applied, but it can also be used in conjunction with other NLP tools to get more meaningful insights.

Another limitation of SentiWordNet is that it is not very nuanced. It is not able detect the difference between specific emotions like sadness and anger, relief, and excitement, but rather lumps things into either being positive or negative.

# What Are Collocations
Collocations are word that combine to have a meaning that is greater than the sum of its parts. They are often well-known phrases in a language, and can be hard to undertand by someone that is not from that culture or by NLP applications.

In [None]:
# output collocations of text 4
print(text4.collocations())

# calculate the mutual information for 'Old World'
text = ' '.join(text4.tokens)
vocab_len = len(set(text))

p_ol = text.count('Old World') / vocab_len
print("p(Old World) =", p_ol)

p_o = text.count('Old') / vocab_len
print("p(Old) =", p_o)

p_w = text.count('World') / vocab_len
print("p(World) = ", p_w)

pmi = math.log2(p_ol / (p_o * p_w))
print("pmi =", pmi)

print()

# calculate the mutual information for 'the citizens'
text = ' '.join(text4.tokens)
vocab_len = len(set(text))

p_ol = text.count('the citizens') / vocab_len
print("p(the citizens) =", p_ol)

p_o = text.count('the') / vocab_len
print("p(the) =", p_o)

p_w = text.count('citizens') / vocab_len
print("p(Worcitizens) = ", p_w)

pmi = math.log2(p_ol / (p_o * p_w))
print("pmi =", pmi)

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
None
p(Old World) = 0.11904761904761904
p(Old) = 0.13095238095238096
p(World) =  0.21428571428571427
pmi = 2.084888897586513

p(the citizens) = 0.13095238095238096
p(the) = 149.21428571428572
p(Worcitizens) =  3.2142857142857144
pmi = -11.838625833423038


# Commentary on results of the Mutual Information Formula and my Interpretation
We can see that 'Old World' has a higher mutual information than 'the citizens', indicating that 'Old World' is more likely to be a collocation. 'the citizens' is simply a bigram that happens to be common, in fact any bigram in the form 'the NOUN' is really common in English. 'Old World', however, carries with it a specific meaning, which is referencing the Americas, a term that goes back to the age of exploration.  

PMI formula: `log_2( P(x, y) / ( P(x) * P(y) ) )`

The PMI formula helps distinguish between common bigrams and collocations by taking into account the frequency of individual words and the probability of them occurring together by chance. In the formula, we take the probability of the bigram occuring, and divide by the product of their individual probablities. Essentially, we taking the likelihood of the bigram occuring, and dividing by the likelihood of this happening by chance, thus a higher score would mean that the words occur together more frequently than would be expected. In the simple bigram 'the citizens', the word 'the' is really common, thus making the denominator of the PMI formula larger, and giving us a lower score. We take the log of these probabilities in order to make the scores easier to interpret, scaling down large probabilities and making very small probabliites negative. 

