# **WordNet**

WordNet is a large database of English words that are organized into various relational groupings. Words can be grouped together into sets of cognitive synonyms called synsets. Additionally, they can be hierarchically grouped in terms of other syntactical relations such as antonyms, hyponyms, hypernyms, meronyms, and holonyms.

In [337]:
# Allows WordNet and SentiWordNet functions to be run
from nltk.corpus import wordnet as wn, sentiwordnet as swn
# For Lesk Algorithm demonstration
from nltk.wsd import lesk
# For Collocations
import math
from nltk.book import text4

## *WordNet Applied to a Noun*

## Viewing All Synsets

`synsets(word, pos)`

`word`: String value - Word to return synsets for

`pos`: *(Optional)* String - Part of speech constraint

*returns*: List of Synsets

In [338]:
# Noun in a variable to easily run the code for other nouns
noun = 'dog'
# Part of speech specified to be a noun
nounSynsets = wn.synsets(noun, pos=wn.NOUN)
print(nounSynsets)

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01')]


## Extracting a Synset's Information

`name()`: *returns* String

`definition()`: *returns* String

`examples()`: *returns* List of Strings

`lemmas()`: *returns* List of Lemmas

In [339]:
# Synset being examined
nounSyn = nounSynsets[0]

# References the list of synsets so this code will run with different choices of noun
nounName = nounSyn.name()
nounDef = nounSyn.definition()
nounUsage = nounSyn.examples()
nounLemmas = nounSyn.lemmas()

print('Definition of', nounName, ':', nounDef, '\n')
print('Usage of', nounName, ':', nounUsage, '\n')
print('Lemmas of', nounName, ':', nounLemmas)

Definition of dog.n.01 : a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds 

Usage of dog.n.01 : ['the dog barked all night'] 

Lemmas of dog.n.01 : [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')]


## WordNet Hierarchy

WordNet organizes synsets into a hierarchy of relations between them. This hierarchy can be traversed by following these relationships. The top of any particular noun's hierachy is `entity.n.01`. Nouns are related to one another in a more structured way in the sense that they can be traced to a hierachy with `entity.n.01` at the top.

#### Terminology

**Hypernym**: A superordinate (has a broader meaning) of a word - *color* is a hypernym of *blue*

**Hyponym**: A subordinate (has a more specific meaning) of a word - *hammer* is a hyponym of *tool*

**Meronym**: A part of something - *toe* is a meronym of *foot*

**Holonym**: The whole that something is a part of - *couch* is a holonym of *cushion*

### Traversing the Hierarchy

Here the common `entity.n.01` synset can be seen for two nouns. Even though they have fairly different meanings, since both `dog.n.01` and `frank.n.02` are nouns they have `entity.n.01`

In [340]:
# A function that traverses up a synset's hierarchy
hyperFunc = lambda syn: syn.hypernyms()
# closure() applies a function to a synset
# Used here to follow the hypernym relations up the hierarchy
nounHierarchy = list(nounSyn.closure(hyperFunc))

# Another noun can be chose to demonstrate the shared entity
altNounSyn = nounSynsets[4]
altNounName = altNounSyn.name()
altNounHierarchy = list(altNounSyn.closure(hyperFunc))

print(nounName, ':', nounHierarchy)
print()
print(altNounName, ':', altNounHierarchy)

dog.n.01 : [Synset('canine.n.02'), Synset('domestic_animal.n.01'), Synset('carnivore.n.01'), Synset('animal.n.01'), Synset('placental.n.01'), Synset('organism.n.01'), Synset('mammal.n.01'), Synset('living_thing.n.01'), Synset('vertebrate.n.01'), Synset('whole.n.02'), Synset('chordate.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]

frank.n.02 : [Synset('sausage.n.01'), Synset('meat.n.01'), Synset('food.n.02'), Synset('solid.n.01'), Synset('matter.n.03'), Synset('physical_entity.n.01'), Synset('entity.n.01')]


### Relations Between Synsets

In [341]:
nounHypernyms = nounSynsets[0].hypernyms()
nounHyponyms = nounSynsets[0].hyponyms()
nounMeronyms = nounSynsets[0].part_meronyms()
nounHolonyms = nounSynsets[0].part_holonyms()
# Antonyms are defined only over lemmas
nounAntonyms = nounSynsets[0].lemmas()[0].antonyms()

print('Hypernyms of', nounName, ':', nounHypernyms, '\n')
print('Hyponyms of', nounName, ':', nounHyponyms, '\n')
print('Meronyms of', nounName, ':', nounMeronyms, '\n')
print('Holonyms of', nounName, ':', nounHolonyms, '\n')
print('Antonyms of', nounName, ':', nounAntonyms, '\n')

Hypernyms of dog.n.01 : [Synset('canine.n.02'), Synset('domestic_animal.n.01')] 

Hyponyms of dog.n.01 : [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')] 

Meronyms of dog.n.01 : [Synset('flag.n.07')] 

Holonyms of dog.n.01 : [] 

Antonyms of dog.n.01 : [] 



## *WordNet Applied to a Verb*

## Viewing All Synsets

Here the part of speech will be limited to verbs only.

In [342]:
verb = 'run'
verbSynsets = wn.synsets(verb, pos=wn.VERB)
print(verbSynsets)

[Synset('run.v.01'), Synset('scat.v.01'), Synset('run.v.03'), Synset('operate.v.01'), Synset('run.v.05'), Synset('run.v.06'), Synset('function.v.01'), Synset('range.v.01'), Synset('campaign.v.01'), Synset('play.v.18'), Synset('run.v.11'), Synset('tend.v.01'), Synset('run.v.13'), Synset('run.v.14'), Synset('run.v.15'), Synset('run.v.16'), Synset('prevail.v.03'), Synset('run.v.18'), Synset('run.v.19'), Synset('carry.v.15'), Synset('run.v.21'), Synset('guide.v.05'), Synset('run.v.23'), Synset('run.v.24'), Synset('run.v.25'), Synset('run.v.26'), Synset('run.v.27'), Synset('run.v.28'), Synset('run.v.29'), Synset('run.v.30'), Synset('run.v.31'), Synset('run.v.32'), Synset('run.v.33'), Synset('run.v.34'), Synset('ply.v.03'), Synset('hunt.v.01'), Synset('race.v.02'), Synset('move.v.13'), Synset('melt.v.01'), Synset('ladder.v.01'), Synset('run.v.41')]


## Extracting a Synset's Information

In [343]:
verbSyn = verbSynsets[3]

verbName = verbSyn.name()
verbDef = verbSyn.definition()
verbUsage = verbSyn.examples()
verbLemmas = verbSyn.lemmas()

print('Definition of', verbName, ':', verbDef, '\n')
print('Usage of', verbName, ':', verbUsage, '\n')
print('Lemmas of', verbName, ':', verbLemmas)

Definition of operate.v.01 : direct or control; projects, businesses, etc. 

Usage of operate.v.01 : ['She is running a relief operation in the Sudan'] 

Lemmas of operate.v.01 : [Lemma('operate.v.01.operate'), Lemma('operate.v.01.run')]


## WordNet Hierarchy

Since verbs don't have `entity.n.01` as a hierarchical root, these hierarchy trees can vary greatly. Verbs are organized in a less structured hierarchy in the sense that they do not necessarily share a single hierarchy tree. This means that different verbs can have different hierarchies. In this case, `operate.v.01` has a root of `control.v.01` while `run.v.01` has a root of `travel.v.01`.

In [344]:
# Uses the hyperFunc lambda function defined earlier
verbHierarchy = list(verbSyn.closure(hyperFunc))

# To examine different hierarchies
altVerbSyn = verbSynsets[0]
altVerbName = altVerbSyn.name()
altVerbHierarchy = list(altVerbSyn.closure(hyperFunc))

print(verbName, ':', verbHierarchy)
print()
print(altVerbName, ':', altVerbHierarchy)

operate.v.01 : [Synset('direct.v.04'), Synset('manage.v.02'), Synset('control.v.01')]

run.v.01 : [Synset('travel_rapidly.v.01'), Synset('travel.v.01')]


## Morphy

Morphy allows us to do searches on inflected forms of words. This is an important function to have because WordNet usually only stores the base forms of words. Morphy applies a set of morphological rules while also checking for exceptions to handle many different possibilities of inflectional forms. This can be useful for words that can have a wide variety of inflected forms. 

`morphy(word, pos)`

`word`: String - the inflected form of a word

`pos`: *(Optional)* String - Part of speech constraint

*returns*: String - The root form of the inflected word

Here we can see morphy in action using the different inflected forms of the word **"dog"**. It can be seen that the inflected words have different synsets associated with them. By using morphy's morphological analysis, however, it can be determined that all the different words with different meanings and seemingly different origins are all actually derived from the same root word.

In [345]:
wordList = ['dogs', 'dogged', 'dogging']

# Extracts and prints root word for the different inflected forms
for word in wordList:
    rootWord = wn.morphy(word)
    print(word, '-', rootWord)

print()

# Shows the various synsets for each inflected form
for word in wordList:
    syns = wn.synsets(word)
    print(word, '-', syns)

dogs - dog
dogged - dog
dogging - dog

dogs - [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
dogged - [Synset('chase.v.01'), Synset('dogged.s.01')]
dogging - [Synset('chase.v.01'), Synset('dogging.s.01')]


## Wu-Palmer Similarity Metric

There are a few different approaches to measuring the semantic similarity of two words. WordNet provides a similarity metric called Wu-Palmer similarity. The Wu-Palmer Metric measures the similarity of two synsets by comparing their depths in the WordNet taxonomy hierarchies as well as the least distant (most specific) common parent node. 

`synset1.wup_similarity(synset2)`

`synset1`, `synset2`: The two synsets that are being compared

*returns*: A decimal that ranges from 0 to 1 (least similar to identical)

In [346]:
# Demonstrates high metric for 2 similar synsets
similarity1 = wn.synset('truck.n.01')
similarity2 = wn.synset('car.n.01')

# Demonstrates low metric for 2 synsets that aren't similar
difference1 = wn.synset('tiger.n.01')
difference2 = wn.synset('cicada.n.01')

print('Similarity Metric of', similarity1.name(), 'and', similarity2.name(), ':', similarity1.wup_similarity(similarity2))
print()
print('Similarity Metric of', difference1.name(), 'and', difference2.name(), ':', difference1.wup_similarity(difference2))

Similarity Metric of truck.n.01 and car.n.01 : 0.9166666666666666

Similarity Metric of tiger.n.01 and cicada.n.01 : 0.6


## Lesk Algorithm

Another similarity algorithm NLTK provides is an implementation of the Lesk Algorithm. The Lesk algorithm uses a context sentence to identify the synset that contains the most similarities to the target word. This algorithm works under the assumption that the context around a specific word in a sentence will likely be related to that word. It finds the highest number of words in common between the words and definitions of the target and the context sentence.

`lesk(sentence, target, pos)`

`sentence`: List of Strings - A list of the words in a context sentence.

`target`: String value - The target word that we want to run the algorithm on.

`pos`: *(Optional)* String - Part of speech constraint

*returns*: Synset - The most overlapping word.

In the context sentence below, the Lesk algorithm does a good job of identifying the correct usage of the word **"house"** to mean **"a gambling house"**. It most likely uses the context from **"wins"** to determine the usage.

In [347]:
# lesk parameters
contSent = ['They', 'say', 'the', 'house', 'always', 'wins', '.']
tarWord = 'house'

# Alternate parameters that can be run by uncommenting the below variables
# contSent = ['The', 'Lucchese', 'family', 'was', 'involved', 'in', 'several', 'crimes', '.']
# tarWord = 'family'

# displays all synsets of the chosen word
for syn in wn.synsets(tarWord):
    print(syn, '-', syn.definition())
print()

leskVal = lesk(contSent, tarWord)

# Formats the context sentence list into an actual sentence string for cleaner display
sentence = ''
for currWord in contSent:
    if contSent.index(currWord) < len(contSent) - 2:
        sentence = sentence + currWord + ' '
    else:
        sentence = sentence + currWord

# Outputs formatted results
print('Context Sentence:', sentence, '\n')
print('Target Word:', tarWord, '\n')
# Displays the computed synset and definition of the target word used in the context of the sentence
print(leskVal, '-', leskVal.definition())

Synset('house.n.01') - a dwelling that serves as living quarters for one or more families
Synset('firm.n.01') - the members of a business organization that owns or operates one or more establishments
Synset('house.n.03') - the members of a religious community living together
Synset('house.n.04') - the audience gathered together in a theatre or cinema
Synset('house.n.05') - an official assembly having legislative powers
Synset('house.n.06') - aristocratic family line
Synset('house.n.07') - play in which children take the roles of father or mother or children and pretend to interact like adults
Synset('sign_of_the_zodiac.n.01') - (astrology) one of 12 equal areas into which the zodiac is divided
Synset('house.n.09') - the management of a gambling house or casino
Synset('family.n.01') - a social unit living together
Synset('theater.n.01') - a building where theatrical performances or motion-picture shows can be presented
Synset('house.n.12') - a building in which something is sheltered or

Using the parameters below, however, we can see that the algorithm has its limits. Here, the word **"car"** is identified to be most similar to `car.n.04` (**"where passengers ride up and down"**), whereas my goal was to use **"car"** to mean **"cable car"**. While it is seen that `cable_car.n.01` is a synset for **"car"** and the sentence even includes the word **"cable"** for additional context, `lesk()` identifies the usage to be `car.n.04`. I believe this is most likely due to the word **"up"** being closer to the target word than the word **"cable"**. While this result is not technically incorrect, it shows that the Lesk algorithm does not always use the context in the sentence as well as a human does.

In [348]:
# Alternate parameters
contSent = ['They', 'rode', 'the', 'car', 'up', 'the', 'cable', '.']
tarWord = 'car'

# displays all synsets of the chosen word
for syn in wn.synsets(tarWord):
    print(syn, '-', syn.definition())
print()

leskVal = lesk(contSent, tarWord)

# Formats the context sentence list into an actual sentence string for cleaner display
sentence = ''
for currWord in contSent:
    if contSent.index(currWord) < len(contSent) - 2:
        sentence = sentence + currWord + ' '
    else:
        sentence = sentence + currWord

# Outputs formatted results
print('Context Sentence:', sentence, '\n')
print('Target Word:', tarWord, '\n')
# Displays the computed synset and definition of the target word used in the context of the sentence
print(leskVal, '-', leskVal.definition())

Synset('car.n.01') - a motor vehicle with four wheels; usually propelled by an internal combustion engine
Synset('car.n.02') - a wheeled vehicle adapted to the rails of railroad
Synset('car.n.03') - the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
Synset('car.n.04') - where passengers ride up and down
Synset('cable_car.n.01') - a conveyance for passengers or freight on a cable railway

Context Sentence: They rode the car up the cable. 

Target Word: car 

Synset('car.n.04') - where passengers ride up and down


## SentiWordnet

SentiWordNet is a lexical analysis tool that allows sentiment analysis. For a synset from WordNet, SentiWordNet has assigned a set of sentiment scores: positivity, negativity, and objectivity. By downloading the `sentiwordnet` corpus, NLTK can be used to utilzie SentiWordNet. By using WordNet to determine the usages of the words in a piece of text, SentiWordNet can be used to determine that text's sentiments. This can be a useful tool for softwares to understanding texts better than just determining the meanings of all the words.

### Finding SentiSynsets For a Word

`senti_synsets(word, pos)`

`word`: String - The chosen word

`pos`: *(Optional)* String - Part of speech constraint

*returns*: Filter **(Must be Typecast to `list`)** - The sentisynsets for the chosen word

In [349]:
word = 'horrific'
sentiSynList = list(swn.senti_synsets(word))

# Displays all sentisynsets for word and their scores
for sSyn in sentiSynList:
    # Formatted Output for current sentisynset
    print('%s\'s Scores:\tPositivity:\t' % sSyn.synset.name(), sSyn.pos_score())
    print('\t\t\tNegativity:\t', sSyn.neg_score())
    print('\t\t\tObjectivity:\t', sSyn.obj_score(),'\n')

hideous.s.01's Scores:	Positivity:	 0.0
			Negativity:	 0.875
			Objectivity:	 0.125 

awful.s.02's Scores:	Positivity:	 0.0
			Negativity:	 0.625
			Objectivity:	 0.375 



### Finding Polarity Scores For Sentences

The example sentence is a very positive sentence, which is reflected in the high positivity score. Curiously enough, however, there is also a negative score on this sentence when to the human eye there is no negativity present. In order to show where this may come from, the sentisynsets generated for the tokens are also logged. This shows some curious sentisynsets being tagged. The word "I", referring to the speaker, gets tagged as `iodine.n.01`. Similarly, "it", referring to the restaurant, gets tagged as `information_technology.n.01`. While neither of these explicitly affect the results of the polarity calculation, they show the potential issues with using SentiWordNet without additional processing. The negativity comes from the word "is", which is tagged as `be.v.01`, which, according to SentiWordNet, has a slightly negative sentiment.

In [350]:
sentence = 'I love that restaurant it is delicious'
# Processes the sentence into a list of string tokens that each contain a word from the sentence
tokens = sentence.split()

posScore = 0
negScore = 0

for currTok in tokens:
    currSynList = list(swn.senti_synsets(currTok))
    if currSynList:
        # Extracts a sentisynset since not all in the list will be necessary
        sSyn = currSynList[0]
        negScore += sSyn.neg_score()
        posScore += sSyn.pos_score()

        #Logging
        print(currTok, '-', sSyn)

print('\nSentence:', sentence)
print('Polarity scores:\tPositive:', posScore)
print('\t\t\tNegative:', negScore)

I - <iodine.n.01: PosScore=0.0 NegScore=0.0>
love - <love.n.01: PosScore=0.625 NegScore=0.0>
restaurant - <restaurant.n.01: PosScore=0.0 NegScore=0.0>
it - <information_technology.n.01: PosScore=0.0 NegScore=0.0>
is - <be.v.01: PosScore=0.25 NegScore=0.125>
delicious - <delicious.n.01: PosScore=0.0 NegScore=0.0>

Sentence: I love that restaurant it is delicious
Polarity scores:	Positive: 0.875
			Negative: 0.125


## Collocations

Collocations are combinations of words that have different or more meaning than any of their words individually do. For example, **"hot dog"** has a different meaning than either of its two parts. Such collocations appear often enough that they can usually be detected. Since the chance of the unrelated words being in that specific order are lower than the frequency of the collocation occurring, it can usually be assumed that these combinations of words have the alternate meaning when they are found. Recognizing these common meanings can be useful for language processing since normal human speech uses many such groupings of words.

`text.collocations()`

`text`: NLTK Text object - The text object that is being searched to find collocations

*returns*: The collocations found in the text

In [351]:
text4.collocations()

United States; fellow citizens; years ago; four years; Federal
Government; General Government; American people; Vice President; God
bless; Chief Justice; one another; fellow Americans; Old World;
Almighty God; Fellow citizens; Chief Magistrate; every citizen; Indian
tribes; public debt; foreign nations


### Mutual Information

Mutual information is a measure of of the mutual dependence between two random variables. The mutual information of a collocation is the measure of how likely it is that the second word of a collocation will appear if the first word has already been found. This can be calculated with the formula: `MI(word1, word2) = log(P(word1, word2)/(P(word1)*P(word2)))`

Here, I chose to calculate the mutual information of the collocation **"fellow citizens"**. One thing I noticed was that NLTK identified **"fellow citizens"** and **"Fellow citizens"** as two separate collocations due to the capitalization. To get around this, I made the full text lowercase. A mutual information measurement of 4.13206 is fairly high compared to any set of two words that don't create a new meaning, indicating that this word combination is a collocation.

In [352]:
# Designed so only collocation variable needs to be changed
bothWords = 'fellow citizens'
word1,word2 = bothWords.split(' ')

# The tokens are made into a single string so consecutive words can be searched for
# lowercase adjusts for any differences in capitalization during the counting process
fullText = ' '.join(text4.tokens).lower()

# determines number of unique tokens in the text
numUniToks = len(set(text4))
pBothWords = fullText.count(bothWords) / numUniToks
pWord1 = fullText.count(word1) / numUniToks
pWord2 = fullText.count(word2) / numUniToks

mutInfo = math.log2(pBothWords / (pWord1 * pWord2))

print('Mutual information of \"%s\": %1.5f' % (bothWords, mutInfo))

Mutual information of "fellow citizens": 4.13206


As seen below, the mutual information of the non-collocation **"the one"** is significantly lower than that of the collocation above. In fact, a negative MI measurement can indicate that the occurence of the two words in this order is even lower in this text than would be possible by random chance. The clear disparity in the MI values of the two word combinations shows that collocations are so common that when those combinations of words are found, it is reasonable to assume and interpret them with the meaning of the collocation instead of the individual words.

In [354]:
# two unrelated words next to each other
bothWords = 'the one'
word1,word2 = bothWords.split(' ')

# The tokens are made into a single string so consecutive words can be searched for
# lowercase adjusts for any differences in capitalization during the counting process
fullText = ' '.join(text4.tokens).lower()

# determines number of unique tokens in the text
numUniToks = len(set(text4))
pBothWords = fullText.count(bothWords) / numUniToks
pWord1 = fullText.count(word1) / numUniToks
pWord2 = fullText.count(word2) / numUniToks

mutInfo = math.log2(pBothWords / (pWord1 * pWord2))

print('Mutual information of \"%s\": %1.5f' % (bothWords, mutInfo))

Mutual information of "the one": -5.45581
