# CS 4395 NLP
## WordNet Assignment
### Samuel Anozie

WordNet is a practical data set of words, including specific word types like nouns, verbs adjectives, and adverbs. Synonyms are grouped together in sets called synsets, and are organized in a hierarchical tree of semantics. Originally created to simulate the way humans theoretically understand the relationships between different words, WordNet is a very useful tool in the exploration of the English language with machines.

In [13]:
from nltk.corpus import wordnet as wn
word = "power"
synsets = wn.synsets(word)
print(synsets)

[Synset('power.n.01'), Synset('power.n.02'), Synset('ability.n.02'), Synset('office.n.04'), Synset('power.n.05'), Synset('exponent.n.03'), Synset('might.n.01'), Synset('world_power.n.01'), Synset('baron.n.03'), Synset('power.v.01')]


WordNet nouns are organized in a hierarchy, with the top noun defaulting to 'entity'. There can be many levels of abstraction from the word 'entity' to get to a final word. In this particular example, there are 4.

In [14]:
picked_synset = synsets[2]
print(picked_synset.definition())
print(picked_synset.lemmas())
print(picked_synset.examples())


possession of the qualities (especially mental qualities) required to do something or get something done
[Lemma('ability.n.02.ability'), Lemma('ability.n.02.power')]
['danger heightened his powers of discrimination']


In [15]:
hyp = picked_synset.hypernyms()[0]
top = wn.synset('entity.n.01')

while hyp:
    print(hyp)
    if hyp == top:
        break
    if hyp.hypernyms():
        hyp = hyp.hypernyms()[0]

Synset('cognition.n.01')
Synset('psychological_feature.n.01')
Synset('abstraction.n.06')
Synset('entity.n.01')


In [16]:
print(picked_synset.hypernyms())
print(picked_synset.hyponyms())
print(picked_synset.part_meronyms())
print(picked_synset.part_holonyms())
for lemma in picked_synset.lemmas():
    print(*lemma.antonyms())

[Synset('cognition.n.01')]
[Synset('aptitude.n.01'), Synset('bilingualism.n.01'), Synset('capacity.n.08'), Synset('creativity.n.01'), Synset('faculty.n.01'), Synset('hand.n.04'), Synset('intelligence.n.01'), Synset('know-how.n.01'), Synset('leadership.n.04'), Synset('originality.n.01'), Synset('skill.n.01'), Synset('skill.n.02'), Synset('superior_skill.n.01')]
[]
[]
Lemma('inability.n.01.inability')



Verbs are handled in WordNet sligntly differently than nouns. Instead of one common hierarchical ancestor, each verb is not guarenteed to have the same ancestors. There are various different root words that derive more verbs.

In [17]:
word = "pick"
synsets = wn.synsets(word)
print(synsets)

[Synset('choice.n.01'), Synset('picking.n.01'), Synset('cream.n.01'), Synset('woof.n.01'), Synset('pick.n.05'), Synset('pick.n.06'), Synset('pick.n.07'), Synset('pick.n.08'), Synset('choice.n.02'), Synset('pick.v.01'), Synset('pick.v.02'), Synset('blame.v.02'), Synset('pick.v.04'), Synset('pick.v.05'), Synset('clean.v.02'), Synset('pick.v.07'), Synset('foot.v.01'), Synset('pluck.v.04'), Synset('pick.v.10'), Synset('peck.v.01'), Synset('nibble.v.03')]


In [18]:
picked_synset = synsets[17]
print(picked_synset.definition())
print(picked_synset.lemmas())
print(picked_synset.examples())

pull lightly but sharply with a plucking motion
[Lemma('pluck.v.04.pluck'), Lemma('pluck.v.04.plunk'), Lemma('pluck.v.04.pick')]
['he plucked the strings of his mandolin']


In [19]:
hyp = picked_synset.hypernyms()[0]
top = picked_synset.root_hypernyms()[0]
while hyp:
    print(hyp)
    if hyp == top:
        break
    if hyp.hypernyms():
        hyp = hyp.hypernyms()[0]

Synset('pull.v.01')
Synset('move.v.02')


In [20]:
wn.morphy(word, wn.VERB)

'pick'

The Wu-Palmer similarity metric defines the level of similarity between two different words on a scale of 0 to 1, with the higher number meaning similar words. The Lesk algorithm, on the other hand, seeks to remove ambiguity for the meanings of certain words by analyting their context. Even though the algorithm is dwarfed by more modern word sense disambiguation processes, it is a lexical foundation that can inform future implementations.

In [21]:
from nltk.wsd import lesk

look = wn.synsets("look")[4]
watch = wn.synsets("watch")[6]
print(wn.wup_similarity(look, watch))

look_sent = "I want to look at the sky"
watch_sent = "I want to watch the clouds"

print(lesk(look_sent.split(), 'look', 'n'))
print(lesk(watch_sent.split(), 'watch', 'v'))

0.5
Synset('spirit.n.02')
Synset('watch.v.05')


The SentiWordNet is one of the more interesting parts of the NLTK corpus. Similar to the WordNet, it is a database of words that include sentiment scores as part of each synset: positivity, negativity, and objectivity. For tasks that need to respond to the sentiment of a sentence instead of just the content, this package is invaluable.

In [22]:
from nltk.corpus import sentiwordnet as swn

breakdown = swn.senti_synset('love.n.01')
print(breakdown)
print("Positive score = ", breakdown.pos_score())
print("Negative score = ", breakdown.neg_score())
print("Objective score = ", breakdown.obj_score())

sent = "I really love cake"
neg = 0
pos = 0
for token in sent.split():
    syn_list = list(swn.senti_synsets(token))
    if syn_list:
        syn = syn_list[0]
        neg += syn.neg_score()
        pos += syn.pos_score()

print("neg\tpos counts")
print(neg, '\t', pos)

<love.n.01: PosScore=0.625 NegScore=0.0>
Positive score =  0.625
Negative score =  0.0
Objective score =  0.375
neg	pos counts
0.0 	 1.25


It's possible that words that are next to each other should not be treated independently, but instead, as a pair. In the below example, we see that words like United States, one another, and Indian tribes only have accurate meanings when they are put together, and would mean different things if they were seperated. For tese cases, collocations are derived using the probability of two words occurring next to each other.

In [23]:
import math
from nltk.book import text4

text4.collocations()
pre = "foreign"
post = "nations"
pre_count = 0
post_count = 0
both_count = 0
pre_hit = False

for token in text4.tokens:
    if token == pre:
        pre_count += 1
        pre_hit = True
    elif token == post and pre_hit:
        post_count += 1
        both_count += 1
        pre_hit = False
    elif token == post:
        post_count += 1
        pre_hit = False

mi = math.log((both_count / len(text4.tokens) / ((pre_count / len(text4.tokens)) * (post_count / len(text4.tokens)))))
print(mi)


United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations
6.197700309293973
