For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week4-wordnet-controlled-vocab/

# Week 4: Mini Wordnet Tutorial

# Expanded Controlled Vocabulary with Wordnet

The 'get_synsets' command in Wordnet unlocks the thesaurus/dictionary in its full potential.  We won't go into the full power of the "synsets," but suffice it to say that Wordnet knows that a "house" when used as a noun can mean a "firm," a "sign of the zodiac," a "family," or a "theater."

In [2]:
from textblob import Word
from nltk.corpus import wordnet as wn

from textblob.wordnet import NOUN

w1 = Word("house")
w1.synsets
syns = w1.get_synsets(pos=NOUN)
print(syns)


[Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12')]


Likewise, wordnet knows that the word "building" can refer to different kinds of construction (as a noun), but it can also be a verb form used with many different senses.

In [3]:
wn.synsets('building')

[Synset('building.n.01'),
 Synset('construction.n.01'),
 Synset('construction.n.07'),
 Synset('building.n.04'),
 Synset('construct.v.01'),
 Synset('build_up.v.02'),
 Synset('build.v.03'),
 Synset('build.v.04'),
 Synset('build.v.05'),
 Synset('build.v.06'),
 Synset('build.v.07'),
 Synset('build.v.08'),
 Synset('build_up.v.04'),
 Synset('build.v.10')]

A *hyponym* is a word that is a more specific version of another word.  So if we want to know the many different types of houses in the dictionary, we can use wordnet's .hyponyms() command to navigate these lists, and we can generate another controlled vocabulary from them.

In [4]:
synlist = wn.synset('house.n.01').hyponyms()
synlist

[Synset('beach_house.n.01'),
 Synset('boarding_house.n.01'),
 Synset('bungalow.n.01'),
 Synset('cabin.n.02'),
 Synset('chalet.n.01'),
 Synset('chapterhouse.n.02'),
 Synset('country_house.n.01'),
 Synset('detached_house.n.01'),
 Synset('dollhouse.n.01'),
 Synset('duplex_house.n.01'),
 Synset('farmhouse.n.01'),
 Synset('gatehouse.n.01'),
 Synset('guesthouse.n.01'),
 Synset('hacienda.n.02'),
 Synset('lodge.n.04'),
 Synset('lodging_house.n.01'),
 Synset('maisonette.n.02'),
 Synset('mansion.n.02'),
 Synset('ranch_house.n.01'),
 Synset('residence.n.02'),
 Synset('row_house.n.01'),
 Synset('safe_house.n.01'),
 Synset('saltbox.n.01'),
 Synset('sod_house.n.01'),
 Synset('solar_house.n.01'),
 Synset('tract_house.n.01'),
 Synset('villa.n.02')]

Wordnet's 'lemmas()' function gives us access to the base lemma associated with any of these categories.  Let's use the 'append' function and the 'lemmas' function to create a vocabulary list stripped of the Wordnet apparatus.  

In [5]:
new_vocab = []

for syn in synlist:
    for lemma in syn.lemmas():
        new_vocab.append(str(lemma.name()))
        
print(new_vocab)

['beach_house', 'boarding_house', 'boardinghouse', 'bungalow', 'cottage', 'cabin', 'chalet', 'chapterhouse', 'fraternity_house', 'frat_house', 'country_house', 'detached_house', 'single_dwelling', 'dollhouse', "doll's_house", 'duplex_house', 'duplex', 'semidetached_house', 'farmhouse', 'gatehouse', 'guesthouse', 'hacienda', 'lodge', 'hunting_lodge', 'lodging_house', 'rooming_house', 'maisonette', 'maisonnette', 'mansion', 'mansion_house', 'manse', 'hall', 'residence', 'ranch_house', 'residence', 'row_house', 'town_house', 'safe_house', 'saltbox', 'sod_house', 'soddy', 'adobe_house', 'solar_house', 'tract_house', 'villa']


Bear in mind: we don't have to stop here.  We can keep drilling down within each of these catergories to get an even finer-grain list.

In [6]:
for syn in synlist:
    print(syn.lemmas())

[Lemma('beach_house.n.01.beach_house')]
[Lemma('boarding_house.n.01.boarding_house'), Lemma('boarding_house.n.01.boardinghouse')]
[Lemma('bungalow.n.01.bungalow'), Lemma('bungalow.n.01.cottage')]
[Lemma('cabin.n.02.cabin')]
[Lemma('chalet.n.01.chalet')]
[Lemma('chapterhouse.n.02.chapterhouse'), Lemma('chapterhouse.n.02.fraternity_house'), Lemma('chapterhouse.n.02.frat_house')]
[Lemma('country_house.n.01.country_house')]
[Lemma('detached_house.n.01.detached_house'), Lemma('detached_house.n.01.single_dwelling')]
[Lemma('dollhouse.n.01.dollhouse'), Lemma('dollhouse.n.01.doll's_house')]
[Lemma('duplex_house.n.01.duplex_house'), Lemma('duplex_house.n.01.duplex'), Lemma('duplex_house.n.01.semidetached_house')]
[Lemma('farmhouse.n.01.farmhouse')]
[Lemma('gatehouse.n.01.gatehouse')]
[Lemma('guesthouse.n.01.guesthouse')]
[Lemma('hacienda.n.02.hacienda')]
[Lemma('lodge.n.04.lodge'), Lemma('lodge.n.04.hunting_lodge')]
[Lemma('lodging_house.n.01.lodging_house'), Lemma('lodging_house.n.01.rooming_h

In [7]:
finer_syns = []

for syn in synlist:
    hypo = syn.hyponyms()
    for h in hypo:
        finer_syns.append(h)
 #   print(syn.hyponyms())
  
print(finer_syns)

[Synset('bed_and_breakfast.n.01'), Synset('log_cabin.n.01'), Synset('chateau.n.01'), Synset('dacha.n.01'), Synset('shooting_lodge.n.01'), Synset('summer_house.n.01'), Synset('villa.n.03'), Synset('villa.n.04'), Synset('lodge.n.03'), Synset('flophouse.n.01'), Synset('manor.n.01'), Synset('palace.n.01'), Synset('stately_home.n.01'), Synset('court.n.09'), Synset('deanery.n.01'), Synset('manse.n.02'), Synset('palace.n.04'), Synset('parsonage.n.01'), Synset('religious_residence.n.01'), Synset('brownstone.n.02'), Synset('terraced_house.n.01')]


In [8]:
new_vocab_finer = []

for syn in finer_syns:
    for subsyn in syn.lemmas():
        w = subsyn.name()
        new_vocab_finer.append(str(w))
            


new_vocab_finer

['bed_and_breakfast',
 'bed-and-breakfast',
 'log_cabin',
 'chateau',
 'dacha',
 'shooting_lodge',
 'shooting_box',
 'summer_house',
 'villa',
 'villa',
 'lodge',
 'flophouse',
 'dosshouse',
 'manor',
 'manor_house',
 'palace',
 'castle',
 'stately_home',
 'court',
 'deanery',
 'manse',
 'palace',
 'parsonage',
 'vicarage',
 'rectory',
 'religious_residence',
 'cloister',
 'brownstone',
 'terraced_house']

In [10]:
clean_new_vocab_finer = []

for vocab in new_vocab_finer:
    v = vocab.replace('-', ' ')
    v = v.replace('_', ' ') 
    clean_new_vocab_finer.append(str(v))
    
clean_new_vocab_finer

['bed and breakfast',
 'bed and breakfast',
 'log cabin',
 'chateau',
 'dacha',
 'shooting lodge',
 'shooting box',
 'summer house',
 'villa',
 'villa',
 'lodge',
 'flophouse',
 'dosshouse',
 'manor',
 'manor house',
 'palace',
 'castle',
 'stately home',
 'court',
 'deanery',
 'manse',
 'palace',
 'parsonage',
 'vicarage',
 'rectory',
 'religious residence',
 'cloister',
 'brownstone',
 'terraced house']

# Exercise

Read the WordNet documentation.  Figure out how to find synonyms instead of hyponyms.  Find all the synonyms for 'happy' in Jane Austen.