# Intro to Wordnet functionality

While local context means a lot in understanding the meanings of words in a corpus, words also have ‘global’ commonly-known context-free meanings as well.

For instance, if we were to hear the adjective “great” without knowing the context, we would assume some meaning for it (i.e. context-free meaning is assigned). 

A reporsitory for such meanings (and other semantic manipulations) of words is a dictionary.

Imagine we could access and query an English dictionary at will inside py. 

We could apply the dictionary to text patterns we see in the corpus in order to obtain greater meaning.

In what follows, we will query Princeton’s wordnet dictionary database from inside py via NLTK and then, spaCy.


### Setup Chunk

In [1]:
import nltk
import pandas as pd
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\31202\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [2]:
from nltk.corpus import wordnet as wn  # how to d/l wordnet in jupy or colab?

# querying for simple synonyms and antonyms
antonyms=[]
synonyms=[]
for syn in wn.synsets("good"): 
    for l in syn.lemmas(): 
        synonyms.append(l.name()) 
        if l.antonyms(): 
            antonyms.append(l.antonyms()[0].name()) 

print("synonyms of good are:\n", synonyms)
print("="*20, "\n")
print("antonyms of good are:\n", antonyms)


synonyms of good are:
 ['good', 'good', 'goodness', 'good', 'goodness', 'commodity', 'trade_good', 'good', 'good', 'full', 'good', 'good', 'estimable', 'good', 'honorable', 'respectable', 'beneficial', 'good', 'good', 'good', 'just', 'upright', 'adept', 'expert', 'good', 'practiced', 'proficient', 'skillful', 'skilful', 'good', 'dear', 'good', 'near', 'dependable', 'good', 'safe', 'secure', 'good', 'right', 'ripe', 'good', 'well', 'effective', 'good', 'in_effect', 'in_force', 'good', 'good', 'serious', 'good', 'sound', 'good', 'salutary', 'good', 'honest', 'good', 'undecomposed', 'unspoiled', 'unspoilt', 'good', 'well', 'good', 'thoroughly', 'soundly', 'good']

antonyms of good are:
 ['evil', 'evilness', 'bad', 'badness', 'bad', 'evil', 'ill']


### Intro to synsets

Synsets or synonym sets are the basic unit of wordnet queries. 

A synset as the name suggests constructs a set of synonyms for different forms and contexts in which a word may occur.

Below, we see 5 different contexts or word forms for 'car' in Eng as per wordnet.


In [3]:
# find synsets for the word 'car'
car_synsets = wn.synsets('car')
car_synsets

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

There's more that can be done in NLP with query-able dictionaries at our fingertips.

For instance we could dig deep to find out the lemma of every different context in which the word (say, 'car') appears.

Or find the more abstract or general form ('Hypernym') of a word (e.g., car --> vehicle or automobile).

Or the more specific form of the word i.e. Hyponym (e.g. 'car' -> 'cable car', 'rail car' etc.)

Or find the POStag etc. Behold below.


In [4]:
# empty lists to populate
synset_id=[]; lemma=[]; defn=[]; example=[]; 
hypernyms=[]; hyponyms=[]; root_hypernym=[]; pos=[]


In [5]:
# loop over every synset for every form of car in wqordnet
for i in range( len(car_synsets)):
    car = car_synsets[i]; car
    synset_id.append(car)
    lemma.append(str(car.lemmas()))
    defn.append(str(car.definition()))
    example.append(str(car.examples()))
    hypernyms.append(str(car.hypernyms()))
    hyponyms.append(str(car.hyponyms()))
    root_hypernym.append(str(car.root_hypernyms()))
    pos.append(str(car.pos()))


In [6]:
# store in DF for good display    
wn_df = pd.DataFrame({'synset_id':synset_id, 'lemma':lemma, 'defn':defn, 
                      'example':example, 'hypernyms':hypernyms, 'hyponyms':hyponyms,
                      'root_hypernym':root_hypernym, 'pos':pos})    


In [7]:
# display results
print(wn_df.shape)  # more colms than rows.
wn_df.T  # hence displaying transpose

(5, 8)


Unnamed: 0,0,1,2,3,4
synset_id,Synset('car.n.01'),Synset('car.n.02'),Synset('car.n.03'),Synset('car.n.04'),Synset('cable_car.n.01')
lemma,"[Lemma('car.n.01.car'), Lemma('car.n.01.auto')...","[Lemma('car.n.02.car'), Lemma('car.n.02.railca...","[Lemma('car.n.03.car'), Lemma('car.n.03.gondol...","[Lemma('car.n.04.car'), Lemma('car.n.04.elevat...","[Lemma('cable_car.n.01.cable_car'), Lemma('cab..."
defn,a motor vehicle with four wheels; usually prop...,a wheeled vehicle adapted to the rails of rail...,the compartment that is suspended from an airs...,where passengers ride up and down,a conveyance for passengers or freight on a ca...
example,['he needs a car to get to work'],['three cars had jumped the rails'],[],['the car was on the top floor'],['they took a cable car to the top of the moun...
hypernyms,[Synset('motor_vehicle.n.01')],[Synset('wheeled_vehicle.n.01')],[Synset('compartment.n.02')],[Synset('compartment.n.02')],[Synset('compartment.n.02')]
hyponyms,"[Synset('ambulance.n.01'), Synset('beach_wagon...","[Synset('baggage_car.n.01'), Synset('cabin_car...",[],[],[]
root_hypernym,[Synset('entity.n.01')],[Synset('entity.n.01')],[Synset('entity.n.01')],[Synset('entity.n.01')],[Synset('entity.n.01')]
pos,n,n,n,n,n


One way to incorporate context around words is to specify domain in which word was invoked. 

E.g., 'bank' can refer to a business, a river, a road etc., which come from domains "business", "geography", "construction" etc.

There is a pre-defined long list of domains which we can invoke and check for. See below for 'price'.

We do this in spaCy, so setup follows.


### Setup Chunk for Spacy

In [8]:
!pip install spacy-wordnet
!pip install py_thesaurus

import spacy
from spacy_wordnet.wordnet_annotator import WordnetAnnotator

import en_core_web_sm
nlp = en_core_web_sm.load()


Collecting spacy-wordnet
  Downloading https://files.pythonhosted.org/packages/f7/f2/4d8070df0f7a7a9eeed74eb7e9ce3cf41349eb5e06b1e088de9eeca630e2/spacy-wordnet-0.0.4.tar.gz (648kB)
Collecting nltk<3.4,>=3.3
  Downloading https://files.pythonhosted.org/packages/50/09/3b1755d528ad9156ee7243d52aa5cd2b809ef053a0f31b53d92853dd653a/nltk-3.3.0.zip (1.4MB)
Building wheels for collected packages: spacy-wordnet, nltk
  Building wheel for spacy-wordnet (setup.py): started
  Building wheel for spacy-wordnet (setup.py): finished with status 'done'
  Created wheel for spacy-wordnet: filename=spacy_wordnet-0.0.4-py2.py3-none-any.whl size=650298 sha256=15676a36cc8c4ba4ffe6cd3e598d574fd70bfea87b7c4d3dc2ff02f42a3c517e
  Stored in directory: C:\Users\31202\AppData\Local\pip\Cache\wheels\25\93\1d\c86db913cd146fc9ddb26d10f56579c5d58a3e00bc8f96a3a6
  Building wheel for nltk (setup.py): started
  Building wheel for nltk (setup.py): finished with status 'done'
  Created wheel for nltk: filename=nltk-3.3-cp37-

In [9]:
# Load an spacy model (supported models are "es" and "en")
# nlp = spacy.load('en')
nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger')
token = nlp('prices')[0]


In [10]:
# wordnet object link spacy token with nltk wordnet interface by giving acces to
token._.wordnet.synsets()
print("===========\n")
token._.wordnet.lemmas()





[Lemma('monetary_value.n.01.monetary_value'),
 Lemma('monetary_value.n.01.price'),
 Lemma('monetary_value.n.01.cost'),
 Lemma('price.n.02.price'),
 Lemma('price.n.02.terms'),
 Lemma('price.n.02.damage'),
 Lemma('price.n.03.price'),
 Lemma('price.n.03.cost'),
 Lemma('price.n.03.toll'),
 Lemma('price.n.04.price'),
 Lemma('price.n.05.price'),
 Lemma('price.n.06.price'),
 Lemma('price.n.07.Price'),
 Lemma('price.n.07.Leontyne_Price'),
 Lemma('price.n.07.Mary_Leontyne_Price')]

In [11]:
# And automatically tags with wordnet domains
token._.wordnet.wordnet_domains()


['book_keeping',
 'numismatics',
 'betting',
 'banking',
 'insurance',
 'racing',
 'social',
 'money',
 'finance',
 'post',
 'law',
 'commerce',
 'enterprise',
 'telegraphy',
 'mathematics',
 'industry',
 'economy',
 'tax',
 'free_time',
 'jewellery',
 'statistics',
 'exchange',
 'buildings',
 'diplomacy',
 'book_keeping',
 'factotum',
 'agriculture',
 'electrotechnology',
 'numismatics',
 'person',
 'telephony',
 'metrology',
 'politics',
 'betting',
 'banking',
 'sociology',
 'insurance',
 'racing',
 'publishing',
 'social',
 'money',
 'card',
 'finance',
 'post',
 'law',
 'topography',
 'tourism',
 'commerce',
 'philology',
 'telegraphy',
 'enterprise',
 'mathematics',
 'time_period',
 'town_planning',
 'animal_husbandry',
 'pure_science',
 'computer_science',
 'economy',
 'industry',
 'tax',
 'quality',
 'free_time',
 'philately',
 'railway',
 'jewellery',
 'telecommunication',
 'statistics',
 'exchange',
 'economy',
 'music']

Now we choose the subset of domains that are relevant and ignore th erest, See below.

In [12]:
# spaCy WordNet lets you find synonyms by domain of interest for example economy
sentence = nlp('I want to withdraw 5,000 rupees')
economy_domains = ['finance', 'banking']

In [13]:
token_with_synsets = [(token, token._.wordnet.wordnet_synsets_for_domain(economy_domains)) for token in sentence]
enriched_sentence = []  # empty list to populate


In [14]:
for token, synsets in token_with_synsets:
    
    if not synsets:
        enriched_sentence.append(token.text)
        
    else:
        lemmas_for_synset = {lemma for s in synsets for lemma in s.lemma_names()}
        enriched_sentence.append('({})'.format('|'.join(lemmas_for_synset)))
        
print(' '.join(enriched_sentence))

I (need|want|require) to (take_out|draw_off|draw|withdraw) 5,000 rupees


## Functionize above and apply

Below I will wrap above code logic into two functions (one for outputting domains identified, second for building the enhanced sentence).

More generally, it is good programming practice to *functionize* anything that'll need to be invoked multiple times. 

Let us examine in which all domains the word 'course' occurs. Best to input plural form of the word.

Behold.


In [15]:
# func 1 to yield possible domains
def yield_poss_domains(focal_token):
    
    # nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger')
    token = nlp(focal_token)[0]; token
    
    poss_domain_list = token._.wordnet.wordnet_domains()
    return(poss_domain_list)

In [16]:
# test-drive func above
domain_list = yield_poss_domains('courses')
print(domain_list)    


['social', 'school', 'pedagogy', 'tennis', 'university', 'golf', 'post', 'theology', 'philosophy', 'social', 'time_period', 'occultism', 'aviation', 'psychology', 'psychological_features', 'psychoanalysis', 'astronautics', 'factotum', 'astronomy', 'architecture', 'electrotechnology', 'fencing', 'mechanics', 'rowing', 'meteorology', 'physics', 'metrology', 'betting', 'skiing', 'nautical', 'engineering', 'skating', 'racing', 'geometry', 'astrology', 'baseball', 'football', 'topography', 'tourism', 'drawing', 'telegraphy', 'time_period', 'tv', 'pure_science', 'aviation', 'electricity', 'bowling', 'vehicles', 'transport', 'atomic_physic', 'optics', 'archaeology', 'quality', 'electronics', 'soccer', 'sport', 'military', 'telecommunication', 'oceanography', 'table_tennis', 'golf', 'radio', 'earth', 'cycling', 'artisanship', 'university', 'school', 'pedagogy', 'gastronomy', 'buildings', 'architecture', 'racing', 'sport', 'play', 'golf', 'vehicles']


In [17]:
# func 2 to take in list of domains & output enhanced sentence
def domain_2_enhSent(domains_list, sentence):    
    
    token_with_synsets = [(token, token._.wordnet.wordnet_synsets_for_domain(domains_list)) for token in sentence]
    enriched_sentence = []  # empty list to populate
    
    for token, synsets in token_with_synsets:
        
        if not synsets:
            enriched_sentence.append(token.text)
            
        else:
            lemmas_for_synset = {lemma for s in synsets for lemma in s.lemma_names()}
            enriched_sentence.append('({})'.format('|'.join(lemmas_for_synset)))
            
    enhSent = ' '.join(enriched_sentence)
    return(enhSent)

In [18]:
# test-drive above
domains_list = ['school', 'pedagogy', 'university']
sent0 = nlp('The MLBM course taught at ISB is broad-based and big-picture in scope.')
enh_sent = domain_2_enhSent(domains_list, sent0)
print(enh_sent)    


The MLBM (course_of_instruction|grade|form|course_of_study|course|class) (learn|teach|instruct) at ISB (be|make_up|comprise|represent|constitute) broad - based and (big|large) - (painting|film|image|motion_picture|motion-picture_show|icon|picture_show|flick|ikon|moving_picture|pic|picture|movie|moving-picture_show) in (reach|scope|ambit|compass|range|orbit) .


That's it from me for now. Back to the Slides.

Voleti.
