## **Portfolio 3: Exploring WordNet**

###### *Author: Shreya Valaboju*
###### *Section: CS 4395.001*

###### *Execute the notebook from top to bottom. For more info, refer to readme_portfolio3.txt* 



WordNet is a lexical database popularly used for computational lingustics and natural language processing. Nouns, verbs, adjectives, and adverbs are grouped into sets of synonyms, also known as "synsets." These synsets are organized hierarchically through hypernyms, hyponyms, holonyms, meronyms, etc. This notebook explores basic functionality of WordNet using nltk. 


In [1]:
# import/download necessary libraries
import nltk
import math

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('sentiwordnet')
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('stopwords')

from nltk.book import text4
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk.corpus import sentiwordnet as swn

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package genesis to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Unzipping corpora/genesis.zip.
[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Unzipping corpora/inaugural.zip.
[nltk_data] Downloading package nps_chat to
[nltk_data]     /Users/shreyavalaboju/nltk_data...
[nltk_data]   Unzipping corpora/nps_c

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


#### 1. Synsets for Nouns and Verbs
Let's explore how WordNet is organized for nouns and verbs

##### Nouns: 

In [2]:
# all synsets for a noun, 'elephant'
wn.synsets('elephant')

[Synset('elephant.n.01'), Synset('elephant.n.02')]

In [3]:
# choose 1 synset out of all for the noun
elephant_synset = wn.synset('elephant.n.01')

In [4]:
# extract definition, usage, lemmas if possible
print('Definition: ', elephant_synset.definition())
print('Usage: ', elephant_synset.examples())
print('Lemmas: ', elephant_synset.lemmas())

Definition:  five-toed pachyderm
Usage:  []
Lemmas:  [Lemma('elephant.n.01.elephant')]


In [5]:
# traverse hiearchy of the synset for the noun (naive approach)
hyp = elephant_synset.hypernyms()[0] # hypernyms give a broader word or synset the noun falls under
top = wn.synset('entity.n.01') # stop once the highest hiearchy synset is reached

while hyp: # keep finding hypernyms (synsets above)
  print(hyp)
  if hyp==top:
    break
  if hyp.hypernyms():
    hyp = hyp.hypernyms()[0]



Synset('pachyderm.n.01')
Synset('placental.n.01')
Synset('mammal.n.01')
Synset('vertebrate.n.01')
Synset('chordate.n.01')
Synset('animal.n.01')
Synset('organism.n.01')
Synset('living_thing.n.01')
Synset('whole.n.02')
Synset('object.n.01')
Synset('physical_entity.n.01')
Synset('entity.n.01')


In [6]:
#print hypernyms, hyponyms, meronyms,holonyms, antonyms

print('Hypernyms: ', elephant_synset.hypernyms())
print('Hyponyms: ', elephant_synset.hyponyms())


# not all nouns have holonyms, meronyms, or antonyms. 
if elephant_synset.member_holonyms():
  print('Holonyms: ', elephant_synset.member_holonyms())
else:
  print('Holonyms: ',list())

if elephant_synset.part_meronyms():
  print('Meronyms: ', elephant_synset.part_meronyms())
else:
  print('Meronyms: ',list())


ant=[] # holds all antonyms found
# iterate through the synset's lemmas to find any antonyms
for lemma in elephant_synset.lemmas():
  if lemma.antonyms():
    ant.append(lemma.antonyms()[0].name())
print('Antonyms: ',ant)


Hypernyms:  [Synset('pachyderm.n.01'), Synset('proboscidean.n.01')]
Hyponyms:  [Synset('african_elephant.n.01'), Synset('gomphothere.n.01'), Synset('indian_elephant.n.01'), Synset('mammoth.n.01'), Synset('rogue_elephant.n.01')]
Holonyms:  [Synset('elephantidae.n.01')]
Meronyms:  [Synset('proboscis.n.02'), Synset('tusk.n.02')]
Antonyms:  []


WordNet organizes nouns through synsets. In this example, we saw that the noun, 'elephant' had synsets placed above in its hiearchy. Similarly, nouns in WordNet are connected through defining hypernyms(higher), hyponyms(lower), meronym(part of), holonym(whole), and troponyms(specific action) synsets. Nouns are the most highly connected synsets. In addition, not all nouns have all the types of relations, for instance, a noun may not have a meronym.  

##### Verbs: 

In [7]:
# explore synsets for a verb
wn.synsets('snoring')

[Synset('snore.n.02'), Synset('snore.v.01')]

In [8]:
# pick a synset for the verb, 'snore'
snore_synset = wn.synset('snore.v.01')

In [9]:
# extract definition, usage, lemmas if possible
print('Definition: ', snore_synset.definition())
print('Usage: ', snore_synset.examples())
print('Lemmas: ', snore_synset.lemmas())

Definition:  breathe noisily during one's sleep
Usage:  ['she complained that her husband snores']
Lemmas:  [Lemma('snore.v.01.snore'), Lemma('snore.v.01.saw_wood'), Lemma('snore.v.01.saw_logs')]


In [10]:
# traverse hiearchy (more sophisticated method)
hyper = lambda s: s.hypernyms()
list(snore_synset.closure(hyper))

[Synset('breathe.v.01')]

In [11]:
# Use morphy to find as many different forms of the word(verb) 
print(wn.morphy('snoring', wn.VERB))
print(wn.morphy('snoring', wn.NOUN))

snore
snoring


WordNet organizes verbs similarly as it does with nouns, through synsets and hierarchy. As we saw with the example, 'snore,' verbs can be in hypernym/hyponym relations. Specifically, 'breathe' was a hypernym of snore. However, lemmas are something to note with verbs. The lemma form of 'snore' could have also bee considered a noun. So, when evaluating/analyzing a verb, it is important to pick the synset that implies that the lemma is intended to a verb. 

#### 2. Similarity between 2 Words
Using various metrics and algorithms to calculate how similar words are

In [12]:
# pick 2 similar words, select synsets for each
person = wn.synset('person.n.01')
human = wn.synset('homo.n.02')

In [13]:
# Calculate Wu-Palmer Similarity metric
wn.wup_similarity(person, human)

0.5714285714285714

In [14]:
# Run the Lesk Algorithm on 'person'
sent_person = ['That', 'person', 'is','my','friend','.']
print(lesk(sent_person, 'person', 'n'))


Synset('person.n.03')


In [15]:
# Run Lesk Algorithm on 'human'
sent_human = ['The','species','is','human','.'] # here is an example sentence where human is used in it.
print(lesk(sent_human, 'human', 'n'))

Synset('homo.n.02')


The Wu-Palmer similarity metric calculates similarity between 2 words by using the depths of the 2 synsets related in the WordNet hierarchy. On the other hand, the Lesk Algorithm looks at context and compares dictionary glosses for word overlap and count to determine the similar synset. From running the blocks above, we can see that the 2 words, 'person' and 'human' are fairly similar, giving us a Wu-Palmer metric score of 0.57. We would expect the similarity to be higher, but we can conclude that maybe the hierarchies for the respective synsets are slightly different. Next, the Lesk Algorithm outputted expected synsets for the 2 words afer using them in sentences. Ther results were expected as both were nouns and the names of the synsets were the exact same. 

#### 3. Senti-WordNet

Senti-WordNet is built on top of WordNet. It is used to further analyze the sentiment, positive or negative and objective or subjective given some text. Senti-WordNet assigns a positive, negative, and objective score. Sentiment analysis is a popular method for many use cases. For example, Senti-WordNet, as an NLP tool, can be used to analyze and improve customer service, social media, or market research given a body of text or words. The few cells below demonstrate how senti-wordnet is used. 

In [16]:
# choose an 'emotionally charged' word
wn.synsets('rage')

[Synset('fury.n.01'),
 Synset('rage.n.02'),
 Synset('rage.n.03'),
 Synset('rage.n.04'),
 Synset('fad.n.01'),
 Synset('ramp.v.01'),
 Synset('rage.v.02'),
 Synset('rage.v.03')]

In [17]:
suffer_synset = wn.synset('rage.n.02')

In [18]:
# get senti-synsets for the emotionally charged word
senti_suffer = swn.senti_synsets('rage','n')
for item in senti_suffer:
    print(item)

<fury.n.01: PosScore=0.25 NegScore=0.5>
<rage.n.02: PosScore=0.0 NegScore=0.125>
<rage.n.03: PosScore=0.625 NegScore=0.0>
<rage.n.04: PosScore=0.0 NegScore=0.125>
<fad.n.01: PosScore=0.25 NegScore=0.0>


In [19]:
# output polarity scores for each senti-synset

for s in swn.senti_synsets('rage','n'):
  print(s)
  print("negative: ", s.neg_score())
  print("positive: ", s.pos_score())
  print("objective: ", s.obj_score())
  print("\n")


<fury.n.01: PosScore=0.25 NegScore=0.5>
negative:  0.5
positive:  0.25
objective:  0.25


<rage.n.02: PosScore=0.0 NegScore=0.125>
negative:  0.125
positive:  0.0
objective:  0.875


<rage.n.03: PosScore=0.625 NegScore=0.0>
negative:  0.0
positive:  0.625
objective:  0.375


<rage.n.04: PosScore=0.0 NegScore=0.125>
negative:  0.125
positive:  0.0
objective:  0.875


<fad.n.01: PosScore=0.25 NegScore=0.0>
negative:  0.0
positive:  0.25
objective:  0.75




In [20]:
# Make up a sentence. Output the polarity for each word in the sentence. (stop words not removed, may need to remove)
sentence = 'women expressed intense rage '
neg=0
pos=0
tokens = sentence.split() # split the sentence into tokens

print("Polarity for each word in the sentence: '",sentence, "'\n")

#iterate through each token and print polarity of each token 
for t in tokens:
  word_syn = list(swn.senti_synsets(t))[0] # pick the first senti-synset
  polarity = word_syn.pos_score() - word_syn.neg_score() # calculate polarity but taking the difference between the positive and negative score
  print(t,": ", polarity)


Polarity for each word in the sentence: ' women expressed intense rage  '

women :  0.0
expressed :  0.0
intense :  -0.125
rage :  -0.25


Here are some interesting observations, knowing our emotionally charged word, 'rage.' In our sentence, the words 'women' and 'expressed' have a neutral sentiment, while 'intense' and 'rage' had negative. These results were mostly expected, although I did predict 'rage' to have a more negative score than just -.25. However, we can observe that variances in scores are possible based on the senti-synset selected with its respective positive and negative scoring. Further, such scores are important in NLP because they allow to better understand and leverage sentiment. Polarity tells use how strong the sentiment is for a particular word. This could be highly useful in determining overall sentiment and what particular words/areas of text are projecting more intense emotion/sentiment. 

#### 4. Text Collocations

A collocation is when 2 more combine and if any word is substituted by chance, we cannot get the intended/correct meaning. For example, the collocation, 'strong tea,' does not mean the tea is muscular or can lift heavy. Collocations can be found using 'PMI' or point-wise mutual information. If PMI = 0, the 2 words are independent. If PMI > 0, then there is a likely collocation and vice versa if PMI < 0. Let's take a closer look at collocations in WordNet. 

In [21]:
# collocations for text4
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


In [22]:
# join text into 1 string and lowercase for convenience
text = ' '.join(text4.tokens).lower() 
text



In [23]:
# choose 1 collocation, calculate mutual information (P(x,y) / [P(x) * P(y)])
collocation = 'chief justice'
collocation_first='chief'
collocation_last='justice'

# mutual information formula is the  log of the probability: P(x,y) / [P(x) * P(y)]

vocab_size = len(set(text4))
prob_cj = text.count(collocation)/vocab_size
print("P('chief justice'): ",prob_cj)

prob_c = text.count(collocation_first)/vocab_size
print("P('chief'): ",prob_c)

prob_j = text.count(collocation_last)/vocab_size
print("P('justice'): ",prob_j)

pmi = math.log2(prob_cj / (prob_c * prob_j)) # log of the probability of the collocation / probabilities of collocation word 1 and 2 each multiplied
print('PMI = ', pmi)

P('chief justice'):  0.001396508728179551
P('chief'):  0.00458852867830424
P('justice'):  0.015461346633416459
PMI =  4.298983176955998


From our results, we can observe that the words 'chief justice,' is likely a collocation as its PMI score is positive and well over 0. The two words also had a low probability or occurrence in the text, as those values are close to 0. We can infer that lower probabilities of words, especially grouped togther, can likely be collocations. Meaning, let's say we wanted to find the mutual information of 'the people.' The word 'the' would appear many times, and so would 'people.' A hypothesis would be that the PMI is lower and since those words are used literally as well. 