# **1**
## **WordNet**
Wordnet is a hierarchical organization of nouns, verbs, adjectives and adverbs.
This was developed at Princeton.
We can use it to find synset, definitions, example and relation to other words (hypernym, hyponym, meronym, holonym, troponym etc.)

# **2**


In [None]:
import math

import sympy
# choosing a noun to work with
from nltk.corpus import wordnet
wordnet.synsets('way')

# **3**

In [None]:
# extracting definition from one of the synsets
wordnet.synset('direction.n.01').definition()

In [None]:
# getting example
wordnet.synset('direction.n.01').examples()

In [None]:
# getting lemma
wordnet.synset('direction.n.01').lemmas()

In [None]:
# traversing the hierarchy
word_hyp = wordnet.synset('direction.n.01')
hyper = lambda s: s.hypernyms()
list(word_hyp.closure(hyper))

## **Observation**
In the output we can see that the further up the hierarchy we go, the word gets more generalized
For example, lions->wildcats->carnivores->mammals->animals and so on.
Which makes sense as the further up the hypernyms of a word we go we get to see more generalized word for the given word

# **4**

In [None]:
wordnet.synset('direction.n.01').hypernyms()

In [None]:
wordnet.synset('direction.n.01').hyponyms()

In [None]:
wordnet.synset('direction.n.01').part_meronyms()

In [None]:
wordnet.synset('direction.n.01').part_holonyms()

In [None]:
antonyms = []
lemmatized = wordnet.synset('direction.n.01').lemmas()
for x in lemmatized:
    if x.antonyms():
        antonyms.append(x.antonyms()[0].name())

print(antonyms)

# **5**

In [None]:
# synset of verb
wordnet.synsets('study')

# **6**

In [None]:
# extracting definition from one of the synsets
wordnet.synset('analyze.v.01').definition()

In [None]:
# getting example
wordnet.synset('analyze.v.01').examples()

In [None]:
# getting lemma
wordnet.synset('analyze.v.01').lemmas()

In [None]:
# traversing the hierarchy
word_hyp = wordnet.synset('analyze.v.01')
hyper = lambda s: s.hypernyms()
list(word_hyp.closure(hyper))

## **Observation**
In the output we don't have any hierarchical data to make assumptions with.
Usually the hierarchy for example looks like, lions->wildcats->carnivores->mammals->animals and so on.
From the observation we can see that not all verbs will have hypernyms

# **7**

In [None]:
# using morphy to find NOUN, VERB, ADJ and ADV
wordnet.morphy('analyze', wordnet.NOUN)

In [None]:
wordnet.morphy('analyze', wordnet.VERB)

In [None]:
wordnet.morphy('analyze', wordnet.ADJ)

In [None]:
wordnet.morphy('analyze', wordnet.ADV)

# **8**

In [None]:
# finding two synsets from two words to compare later
wordnet.synsets('bus')

In [None]:
wordnet.synsets('ship')

In [None]:
# finding similarity using Wu-Palmer

bus = wordnet.synset('bus.n.01')
ship = wordnet.synset('ship.n.01')
wordnet.wup_similarity(bus, ship)

In [None]:
from nltk.wsd import lesk
from nltk import word_tokenize

# applying the Lesk algorithm
bus_sentence = word_tokenize(wordnet.synset('bus.n.01').examples()[0])
bus_sentence_string =' '.join(bus_sentence)
print(f'Given sentence: {bus_sentence_string}')
print(lesk(bus_sentence, 'bus', 'n'))

ship_sentence = ['they', 'booked', 'a', 'ticket', 'on', 'the', 'cruise', 'ship']
ship_sentence_string =' '.join(ship_sentence)
print(f'Given sentence: {ship_sentence_string}')
print(lesk(ship_sentence, 'ship', 'n'))

## **Observation**
In the output for 'bus' we see that it is not the correct meaning in regard to the context, here it assumes we mean bus as a topology instead of the vehicle
In the output for 'ship' we see that it is the correct meaning in regard to the context, where it means the ship as a noun
From the observation we can see that the Lesk algorithm is not 100% right but is usually able to get the meaning from context given it has the information in its corpus

# **9**
## **SentiWordNet**
This functionality is built on top of WordNet, it can be used to do sentiment analysis.
It gives a rating of positivity, negativity and objectivity to the given sentence.
This can be used to assess user input and what they are trying to express by it.
It becomes specially helpful for voice assistants and AI in general.

In [None]:
from nltk.corpus import sentiwordnet

# 'compassion.n.01'
# selecting an emotionally charged word
wordnet.synsets('affection')
affection = sentiwordnet.senti_synset('affection.n.01')
print(affection)

In [None]:
senti_sent = "I enjoy spending time with my family".split()
#senti_sent = "I enjoy spending time family".split()
for word in senti_sent:
    syn_list = list(sentiwordnet.senti_synsets(word))
    if len(syn_list) > 0 :
        print(f'Polarity of \"{word}\" : {syn_list[0]}')
    else:
        print(f'Sorry polarity of \"{word}\" can not be determined as there is no senti_synset for it')



## **Observation**
In the output we can see that polarity is not available for all words as they don't exist in the senti_synset.
Also, it may give wrong answers as it makes assumptions, here "I" was to define a person/myself but the function assumed it meant Iodine
But for all emotionally charged words like "enjoy" we can get the polarity fairly accurately.
As we mentioned before, this can be used in AI, voice assistants and more. If I was talking to an AI and I said this sentence to it
then I'd expect it to reply in a positive manner. If I said "I am feeling down" then it should reply in an encouraging manner.
The AI would only be able to understand they proper way to respond if it is able to use sentiment analysis well.

# **10**
# Collocation
We have many common set of words that are highly likely to be found right next to each other.
Some general example can be, "social media", "desktop computer", "artificial intelligence"
Having this knowledge we are able to fill in what word might come next.

In [None]:
from nltk.book import text4

print(text4.collocations())

In [None]:
from collections import Counter
from nltk import bigrams
import math

# Calculating mutual information of "years ago"
count_bigram = 0
bi_grams = list(bigrams(text4))
for freq in bi_grams:
    if freq[0] == 'years' and freq[1] == 'ago':
        count_bigram = count_bigram + 1

count_one = 0
count_two = 0
for freq in text4:
    if freq == 'years':
        count_one = count_one + 1
    if freq == 'ago':
        count_two = count_two + 1


print(f'\"years ago\" appears: {count_bigram} times')
print(f'tokens in \"text4\" with no preprocessing: {len(text4)}')
print(f'\"years\" appears: {count_one} times')
print(f'\"ago\" appears: {count_two} times')

print('Formula for PMI is :')
print('\tlog2((P(x,y))/(P(x) * P(y))')
print(f'\tlog2({count_bigram}/({len(text4)}-1))/(({count_one}/{len(text4)}) * ({count_two}/{len(text4)}))')
pxy = count_bigram/(len(text4)-1)
px = count_one/len(text4)
py = count_two/len(text4)
print(f'\tlog2({pxy})/(({px}) * ({py}))')
pxpy = px * py
print(f'\tlog2({pxy})/({pxpy})')
div = pxy / pxpy
print(f'\tlog2({div})')
ans = math.log2(div)
ans = "{:.2f}".format(ans)
print(f'Answer is : {ans}')
if float(ans)>0:
    print('It is likely to be a collocation')
if float(ans)==0:
    print('It is likely both words are independent')
if float(ans)<0:
    print('It is NOT likely to be a collocation')

## **Observation**
The bi-gram chosen was happening often in the text, so it is likely to be a collocation. As we calculated the PMI to be a positive number it is very likely to be a collocation
The collocation was done on text4 which is the Inaugural corpus. Just from the context we can say that this bi-gram is likely to happen, but now we have the data to prove it.