For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week4-wordnet-controlled-vocab/

# Week 4: Mini Wordnet Tutorial

One of the ways to build a compelling case for studying cultural difference and historical change is to examine a "controlled vocabulary" -- or a list of words that are semantically related.

We can use a dictionary or a thesaurus to generate a list of words that are related.  For instance, searching for the following words would give us a good set of vocabulary for studying the discourse of nonsense:

    buncombe
    horsefeathers
    applesauce
    hooey
    phooey
    fiddle-faddle
    
If we searched the parliamentary debates for the debates where speakers use these words, we have a good change of finding the debates where parliamentarians are expressing a lack of respect for each other. We can go from that set of debates to a study of the most contentious topics in parliament.

We can use the **WordNet** package like a thesaurus to look up synonyms for words and to generate a list of semantically related words.  

We'll learn the following commands:
   * **Word()** -- tell Wordnet to look up a new word.
   * **.synsets** or **.get_synsets()** -- which look up files of semantically related words in which a given word appears. Once I have the name of these files, I can navigate to the individual words that are synonyms for my word.
   * We navigate to the individual words inside the Synset files by using the commands **.lemmas()** and **.name().**
   * **.hyponyms()** tells WordNet to look up the particularized synsets that are hierarchically inferior to a given word. For example, .hyponyms('cutlery') would give me the synsets for 'fork,' 'spoon,' and 'knife.'
   
In this notebook, we'll learn to apply these commands to generate a rich list of semantically-related words that the analyst might use for searching text. We'll learn how to format the results for future text mining.  And we'll start thinking about which words are useful. 

In a future notebook, we'll learn how to search historical texts with these rich lists of words.  


## Navigating Wordnet's Synsets to find Synonyms

The key to Wordnet are the files of related words, called **synsets**.  Synsets are named with the part of speech of the word -- typically a noun or a verb.

In [60]:
from textblob import Word
from nltk.corpus import wordnet as wn

from textblob.wordnet import NOUN

w1 = Word("house")
w1

'house'

Let's look up the files in which the word 'house' appears.

In [61]:
w1.synsets

[Synset('house.n.01'),
 Synset('firm.n.01'),
 Synset('house.n.03'),
 Synset('house.n.04'),
 Synset('house.n.05'),
 Synset('house.n.06'),
 Synset('house.n.07'),
 Synset('sign_of_the_zodiac.n.01'),
 Synset('house.n.09'),
 Synset('family.n.01'),
 Synset('theater.n.01'),
 Synset('house.n.12'),
 Synset('house.v.01'),
 Synset('house.v.02')]

Notice that Wordnet knows that the word "house" can refer to a "sign of the zodiac." It knows that sometimes when we refer to a "house" we refer to business firms, other of which refer to dwellings, signs of the zodiac, families, or theaters.  In general, these usages are listed in order from the most prevalent to the least.

Likewise, wordnet knows that the word "building" can refer to different kinds of construction (as a noun), but it can also be a verb form used with many different senses.

In [62]:
wn.synsets('building')

[Synset('building.n.01'),
 Synset('construction.n.01'),
 Synset('construction.n.07'),
 Synset('building.n.04'),
 Synset('construct.v.01'),
 Synset('build_up.v.02'),
 Synset('build.v.03'),
 Synset('build.v.04'),
 Synset('build.v.05'),
 Synset('build.v.06'),
 Synset('build.v.07'),
 Synset('build.v.08'),
 Synset('build_up.v.04'),
 Synset('build.v.10')]

Notice that some of the synsets have an 'n', meaning that they are synonyms for 'house' as a noun, as in "a shelter."

Others have a 'v', meaning that they are synonyms for 'house' as a verb, as in, 'to provide shelter.'

We can use the "pos = " parameter of the command .get_synsets() to tell WordNet that we're only interested in the nouns.


In [63]:
mysynsets = w1.get_synsets(pos=NOUN)
print(mysynsets)

[Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12')]


We can navigate a list of synsets like a list, with square brackets

In [64]:
mysynsets[0]

Synset('house.n.01')

In [65]:
mysynsets[1]

Synset('firm.n.01')

#### Getting the individual words from the synset

We navigate to the individual words inside the Synset files by using the commands **.lemmas()** and **.name().**

Wordnet's 'lemmas()' function gives us access to the base lemma associated with any of these categories. 

In [66]:
mylemmas=mysynsets[1].lemmas()
mylemmas

[Lemma('firm.n.01.firm'),
 Lemma('firm.n.01.house'),
 Lemma('firm.n.01.business_firm')]

This information tells us that the second synset for house -- 'firm.m.01' -- contains the words  'firm,' 'house,', and 'business firm,' each of which can be used as synonyms for each other.

We can navigate *mylemmas* like a list

In [67]:
mylemmas[0]

Lemma('firm.n.01.firm')

In [68]:
mylemmas[1]

Lemma('firm.n.01.house')

In [69]:
mylemmas[2]

Lemma('firm.n.01.business_firm')

In *.lemmas()*, the words are labeled according to the name of the synset ('firm.n.01') and the name of the lemma ('firm,' 'house', or 'business firm').  

If we just want to get to the root words, we use **.name()**

In [70]:
mylemmas[0].name()

'firm'

### Get all the lemmas from a synset

 Let's use the 'append' function and the 'lemmas' function to create a vocabulary list stripped of the Wordnet apparatus.  

In [71]:
new_vocab = [] # start with an empty list 

for syn in mysynsets: # move through each synset in my list of synsets
    for lemma in syn.lemmas(): # move through each lemma in the synset
        lemmaname = lemma.name() # get the root word attached to the lemma
        new_vocab.append(str(lemmaname)) # save that root word in the list new_vocab
        
print(new_vocab)

['house', 'firm', 'house', 'business_firm', 'house', 'house', 'house', 'house', 'house', 'sign_of_the_zodiac', 'star_sign', 'sign', 'mansion', 'house', 'planetary_house', 'house', 'family', 'household', 'house', 'home', 'menage', 'theater', 'theatre', 'house', 'house']


Notice that we have all the synonyms for "house" from the categories we saw above -- some of which refer to business firms, other of which refer to dwellings, signs of the zodiac, families, or theaters. Notice that there is repetition, because the root word 'house' is also stored as a lemma under the synset for the family "house" and the theater "house" etc.  

We could eliminate repetitions from this list and we would have a master controlled vocabulary for looking for synonyms for the word 'house" in text.

## Finding Finer Senses of Meaning: Looking Up Hyponyms 

One of the most exciting features of a wordnet is that we don't have to stop here, with our short list of synonyms.  We might be interested in words for particular *kinds* of houses -- for instance, cabins, or bungalows.  

Searching for this more complete list is more likely to give us good results in text mining. That way, if someone refers to a treehouse or a hut, we'll still know that they're talking about houses. In fact, we'll be able to examine our data for *how* people talk about their houses.  

The name of the linguistic feature when a word is particularized from general to specific is a *hyponym*.

Wordnet allows us to detect the hyponyms for any given word.  Using our new tools for navigating synsets, we can keep drilling down within each of these catergories to get an even finer-grain list.

#### The .hyponyms() function

If we want to know the many different types of houses in the dictionary, we can use wordnet's .hyponyms() command to navigate these lists, and we can generate another controlled vocabulary from them.

.hyponyms() is applied to one synset at a time. It takes no object.

In [72]:
mysynlist = wn.synset('house.n.01')
hyposynlist = mysynlist.hyponyms()
hyposynlist

[Synset('beach_house.n.01'),
 Synset('boarding_house.n.01'),
 Synset('bungalow.n.01'),
 Synset('cabin.n.02'),
 Synset('chalet.n.01'),
 Synset('chapterhouse.n.02'),
 Synset('country_house.n.01'),
 Synset('detached_house.n.01'),
 Synset('dollhouse.n.01'),
 Synset('duplex_house.n.01'),
 Synset('farmhouse.n.01'),
 Synset('gatehouse.n.01'),
 Synset('guesthouse.n.01'),
 Synset('hacienda.n.02'),
 Synset('lodge.n.04'),
 Synset('lodging_house.n.01'),
 Synset('maisonette.n.02'),
 Synset('mansion.n.02'),
 Synset('ranch_house.n.01'),
 Synset('residence.n.02'),
 Synset('row_house.n.01'),
 Synset('safe_house.n.01'),
 Synset('saltbox.n.01'),
 Synset('sod_house.n.01'),
 Synset('solar_house.n.01'),
 Synset('tract_house.n.01'),
 Synset('villa.n.02')]

Let's use .lemmas() and .name() to call the words associated with these folders.

In [73]:
for syn in hyposynlist:
    for l in syn.lemmas():
        print(l.name())

beach_house
boarding_house
boardinghouse
bungalow
cottage
cabin
chalet
chapterhouse
fraternity_house
frat_house
country_house
detached_house
single_dwelling
dollhouse
doll's_house
duplex_house
duplex
semidetached_house
farmhouse
gatehouse
guesthouse
hacienda
lodge
hunting_lodge
lodging_house
rooming_house
maisonette
maisonnette
mansion
mansion_house
manse
hall
residence
ranch_house
residence
row_house
town_house
safe_house
saltbox
sod_house
soddy
adobe_house
solar_house
tract_house
villa


Notice that we've gotten a lot of data -- and this was just ONE of the dozen different synsets for  the word 'house.' This is okay, because we probably don't want to know the synsets for 'house' as in 'business firm' or 'sign of the zodiac' if we're trying to understand dwellings.


Let's practice calling up the lemmas for a synset again.

In [74]:
for synset in Word('park').synsets:
    for lemma in synset.lemmas():
        print(lemma.name())

park
parkland
park
commons
common
green
ballpark
park
Park
Mungo_Park
parking_lot
car_park
park
parking_area
park
park
park


In [75]:
for synset in Word('crime').synsets:
    for lemma in synset.lemmas():
        print(lemma.name())

crime
offense
criminal_offense
criminal_offence
offence
law-breaking
crime


In [76]:
for synset in Word('girl').synsets:
    for lemma in synset.lemmas():
        print(lemma.name())

girl
miss
missy
young_lady
young_woman
fille
female_child
girl
little_girl
daughter
girl
girlfriend
girl
lady_friend
girl


Let's practice calling up all the lemmas for the hyponyms of the same words.

In [77]:
for synset in Word('park').synsets:
    for s in synset.hyponyms():
        for lemma in s.lemmas():
            print(lemma.name())

national_park
safari_park
amusement_park
funfair
pleasure_ground
village_green
used-car_lot
angle-park
double-park
parallel-park


In [78]:
for synset in Word('crime').synsets:
    for s in synset.hyponyms():
        for lemma in s.lemmas():
            print(lemma.name())

attack
attempt
barratry
capital_offense
cybercrime
felony
forgery
fraud
Had_crime
hijack
highjack
mayhem
misdemeanor
misdemeanour
infraction
violation
infringement
perpetration
commission
committal
statutory_offense
statutory_offence
regulatory_offense
regulatory_offence
Tazir_crime
thuggery
treason
high_treason
lese_majesty
vice_crime
victimless_crime
war_crime


In [79]:
for synset in Word('girl').synsets:
    for s in synset.hyponyms():
        for lemma in s.lemmas():
            print(lemma.name())

baby
babe
sister
belle
bimbo
chachka
tsatske
tshatshke
tchotchke
tchotchkeleh
chit
colleen
dame
doll
wench
skirt
chick
bird
flapper
gal
gamine
Gibson_girl
lass
lassie
young_girl
jeune_fille
maid
maiden
May_queen
queen_of_the_May
mill-girl
party_girl
peri
ring_girl
rosebud
sex_kitten
sexpot
sex_bomb
shop_girl
soubrette
sweater_girl
tomboy
romp
hoyden
valley_girl
working_girl
Campfire_Girl
farm_girl
flower_girl
moppet
schoolgirl
Scout
mother's_daughter


### Drilling down into even more particular hyponyms

Hyponyms don't just go one level deep.  Each of the individual synsets in the list **hyposynlist** may have hyponyms into which it can be particularized.  

Let's look at the synset  Synset('country_house.n.01'), which is hyposynlist[6].  Does it have any hyponyms underneath it?  How many kinds of country houses are there, according to Wordnet?

In [80]:
finerhyposynlist = hyposynlist[6].hyponyms()
finerhyposynlist

[Synset('chateau.n.01'),
 Synset('dacha.n.01'),
 Synset('shooting_lodge.n.01'),
 Synset('summer_house.n.01'),
 Synset('villa.n.03'),
 Synset('villa.n.04')]

Let's get all the words from both lists.

Wordnet knows that 'Country Houses' contains chateaus, dachas, shooting lodges, summer houses, and villas.  

Let's get all the subcategories of different words for dwelling -- by moving through all the words in hyposynlist.

In [81]:
finer_syns = [] # create an empty list
 
for syn in hyposynlist: # loop through all the synsets
    myhyponyms = syn.hyponyms()
    for h in myhyponyms:
        finer_syns.append(h)
  
print(finer_syns)

[Synset('bed_and_breakfast.n.01'), Synset('log_cabin.n.01'), Synset('chateau.n.01'), Synset('dacha.n.01'), Synset('shooting_lodge.n.01'), Synset('summer_house.n.01'), Synset('villa.n.03'), Synset('villa.n.04'), Synset('lodge.n.03'), Synset('flophouse.n.01'), Synset('manor.n.01'), Synset('palace.n.01'), Synset('stately_home.n.01'), Synset('court.n.09'), Synset('deanery.n.01'), Synset('manse.n.02'), Synset('palace.n.04'), Synset('parsonage.n.01'), Synset('religious_residence.n.01'), Synset('brownstone.n.02'), Synset('terraced_house.n.01')]


Wordnet knows that there are bed and breakfasts, log cabins, palaces, stately homes, parsonages, and deaneries, to name a few.

Let's practice some more. Let's drill down into some more categories.

In [82]:
mywords = []

for synset in Word('girl').synsets:
    for s in synset.hyponyms():
        for syn in s.hyponyms():
            for lemma in s.lemmas():
                if lemma.name() not in mywords:
                    mywords.append(lemma.name())
                    print(lemma.name())

lass
lassie
young_girl
jeune_fille
maid
maiden
Scout


In [83]:
mywords = []

for synset in Word('crime').synsets:
    for s in synset.hyponyms():
        for syn in s.hyponyms():
            for lemma in s.lemmas():
                if lemma.name() not in mywords:
                    mywords.append(lemma.name())
                    print(lemma.name())

attack
attempt
felony
fraud
hijack
highjack
misdemeanor
misdemeanour
infraction
violation
infringement
statutory_offense
statutory_offence
regulatory_offense
regulatory_offence
vice_crime


In [84]:
mywords = []

for synset in Word('garden').synsets:
    for s in synset.hyponyms():
        for syn in s.hyponyms():
            for lemma in s.lemmas():
                if lemma.name() not in mywords:
                    mywords.append(lemma.name())
                    print(lemma.name())

flower_garden
grove
woodlet
orchard
plantation
kitchen_garden
vegetable_garden
vegetable_patch


### Keep Drilling

Returning just to the list of countryhouses, let's see far we can take 'houses' into particulars.

In [85]:
finerhyposynlist

[Synset('chateau.n.01'),
 Synset('dacha.n.01'),
 Synset('shooting_lodge.n.01'),
 Synset('summer_house.n.01'),
 Synset('villa.n.03'),
 Synset('villa.n.04')]


Let's look at all the words by calling the .lemmas() and .name() for each synset we generated.

In [86]:
for syn in finerhyposynlist:
    for l in syn.lemmas():
        print(l.name())

chateau
dacha
shooting_lodge
shooting_box
summer_house
villa
villa


Next, we can use .hyponyms() again to ask if any of the synsets we generated last time have smaller particular divisions of language.

Are there particular kinds of villas, shooting lodges, etc?

In [87]:
villa = finerhyposynlist[4]
villa

Synset('villa.n.03')

In [88]:
villa = finerhyposynlist[4]
finerfinerhyposynlist = villa.hyponyms()
finerfinerhyposynlist

[]

In [89]:
villa = finerhyposynlist[5]
finerfinerhyposynlist = villa.hyponyms()
finerfinerhyposynlist

[]

In [90]:
shootinglodge = finerhyposynlist[2]
finerfinerhyposynlist = villa.hyponyms()
finerfinerhyposynlist

[]

No.  There are no subdivisions of the word 'villa' on record. However, we can look at the synonyms listed for villa.

In [91]:
shootinglodgewords = [] # start with an empty list 

for lemma in shootinglodge.lemmas(): # move through each lemma in the synset
        lemmaname = lemma.name() # get the root word attached to the lemma
        shootinglodgewords.append(str(lemmaname)) # save that root word in the list new_vocab
        
print(shootinglodgewords)

['shooting_lodge', 'shooting_box']


There are two phrases here for referring to shooting lodges: shooting lodge and shooting box.

How far do you have to go to create a solid analysis?  That's a measure of personal taste. However, you should know that you can keep using the command .hyponyms() to inspect more and more particular meanings of words.

### Using a For Loop to Get all the hyponym words for 'house' as dwelling

We now have lists with at least to levels of detail -- hyposynlist and finer_syns.  

Hyposynlist represents the subcategories of 'house' as dwelling.

Finer_syns represents the subcategoreis of each folder in hyposynlist.

Let's get all the words from both variables.

In [92]:
housinglemmas = [] # create an empty list

for syn in hyposynlist:
    for l in syn.lemmas():
        if l.name() not in housinglemmas: # check for uniqueness
            housinglemmas.append(l.name())

for syn in finer_syns:
    for l in syn.lemmas():
         if l.name() not in housinglemmas: # check for uniqueness
            housinglemmas.append(l.name())

housinglemmas

['beach_house',
 'boarding_house',
 'boardinghouse',
 'bungalow',
 'cottage',
 'cabin',
 'chalet',
 'chapterhouse',
 'fraternity_house',
 'frat_house',
 'country_house',
 'detached_house',
 'single_dwelling',
 'dollhouse',
 "doll's_house",
 'duplex_house',
 'duplex',
 'semidetached_house',
 'farmhouse',
 'gatehouse',
 'guesthouse',
 'hacienda',
 'lodge',
 'hunting_lodge',
 'lodging_house',
 'rooming_house',
 'maisonette',
 'maisonnette',
 'mansion',
 'mansion_house',
 'manse',
 'hall',
 'residence',
 'ranch_house',
 'row_house',
 'town_house',
 'safe_house',
 'saltbox',
 'sod_house',
 'soddy',
 'adobe_house',
 'solar_house',
 'tract_house',
 'villa',
 'bed_and_breakfast',
 'bed-and-breakfast',
 'log_cabin',
 'chateau',
 'dacha',
 'shooting_lodge',
 'shooting_box',
 'summer_house',
 'flophouse',
 'dosshouse',
 'manor',
 'manor_house',
 'palace',
 'castle',
 'stately_home',
 'court',
 'deanery',
 'parsonage',
 'vicarage',
 'rectory',
 'religious_residence',
 'cloister',
 'brownstone',
 '

Wow! So many words for housing! -- and we could keep drilling down if we wanted to.

As an analyst, you will often want to generate a rich list of synonyms and hyponyms like this one.  You will have to be the judge of when enough is enough. You will also have to navigate WordNet on your own to grab a rich sense of semantically connected words with precise meaning.



At a certain point, there are diminishing returns.

In [93]:
housinglemmas2 = [] # create an empty list

still_finer_syns = [] # create an empty list
 
for syn in finer_syns: # loop through all the synsets
    myhyponyms = syn.hyponyms()
    for h in myhyponyms:
        still_finer_syns.append(h)
              
for syn in still_finer_syns:
    for l in syn.lemmas():
         if l.name() not in housinglemmas2: # check for uniqueness
            housinglemmas2.append(l.name())

housinglemmas2

['alcazar', 'glebe_house', 'convent', 'monastery', 'priory']


## **Cleaning Up a  Controlled Vocabulary**
Controlled vocabulary have to be used with care -- and with critical thinking.  What's in the controlled vocabulary matters.  "Applesauce" is listed in the thesaurus as a synonym for nonsense, but it also means a sauce made from cooked apples.  The term is *ambiguous*.  We could search for it, but we'd want to treat the term with particular care -- otherwise we might just find debates about apples. Therefore we should eliminate "applesauce" from our controlled vocabulary before searching

Returning to the words in the variable *housinglemmas* above -- what do you think of our results? Are all of them precise as words for describing housing? Or are some of the indeterminate in meaning?  

Here's what I see:

   * 'hall' and 'court' can have other meanings -- if we're using these words to mine for text, we probably want to be careful about those words. 

In [94]:
housinglemmas2 = []

for word in housinglemmas:
    if word not in ['hall', 'court']:
        housinglemmas2.append(word)

housinglemmas2[:15]

['beach_house',
 'boarding_house',
 'boardinghouse',
 'bungalow',
 'cottage',
 'cabin',
 'chalet',
 'chapterhouse',
 'fraternity_house',
 'frat_house',
 'country_house',
 'detached_house',
 'single_dwelling',
 'dollhouse',
 "doll's_house"]

Housinglemmas2 now has everything except 'hall' and 'court.' This is more appropriate for use with text mining, because we're more liable to get meaningful results with a more specific list.

As an analyst, you will have to make critical decisions about which of these words to use. 

## Cleaning the data for use with raw text

To prepare our list of words for matching with historic texts, we need to make sure that there aren't any characters that will interfere with the matchcing process.

Hyphens and underscores are a problem for us.  Boardinghouse may be written "boarding house" or "boardinghouse" in a historical text, but not "boarding_house."  What should we do?

The easiest way to deal with this problem is to apply a cleaning script to BOTH the list of controlled vocabulary AND the historical text where we remove hyphens and underscores, replacing them all with spaces.  Then we can search for the word "boardinghouse" and the bigram "boarding house" in the raw text, and we will get all occurrences, no matter how they were punctuated.  

#### Remove hyphens and underscores from both the historical text and the controlled vocabulary

The script below can be used to clean both the historical text and the controlled vocabulary, replacing hyphens and underscores with spaces in both.

In [95]:
cleanwords = []

for word in housinglemmas2:
    v = word.replace('-', ' ')
    v = v.replace('_', ' ') 
    cleanwords.append(str(v))
    
cleanwords[:10]

['beach house',
 'boarding house',
 'boardinghouse',
 'bungalow',
 'cottage',
 'cabin',
 'chalet',
 'chapterhouse',
 'fraternity house',
 'frat house']

### Let's intensively look into words for crime

Let's explore what it looks like to really dive deep into Wordnet, using Wordnet to generate a list of words for 'crime.'

In [96]:
from nltk.corpus import wordnet as wn
from textblob import Word

controlled_vocab = []
        
# get hyponyms of the hyponyms
for synset in Word('crime').synsets:
    for lemma in synset.lemmas():
        if lemma.name() not in controlled_vocab:
            controlled_vocab.append(lemma.name())
    for s in synset.hyponyms():
        for lemma in s.lemmas():
            if lemma.name() not in controlled_vocab:
                controlled_vocab.append(lemma.name())
        for syn in s.hyponyms():
            for lemma in s.lemmas():
                if lemma.name() not in controlled_vocab:
                    controlled_vocab.append(lemma.name())        
            for ss in syn.hyponyms():
                for lemma in ss.lemmas():
                    if lemma.name() not in controlled_vocab:
                        controlled_vocab.append(lemma.name())
                for sss in ss.hyponyms():
                    for lemma in sss.lemmas():
                        if lemma.name() not in controlled_vocab:
                            controlled_vocab.append(lemma.name())
                    for ssss in sss.hyponyms():
                        for lemma in ssss.lemmas():
                            if lemma.name() not in controlled_vocab:
                                controlled_vocab.append(lemma.name())

     
controlled_vocab

['crime',
 'offense',
 'criminal_offense',
 'criminal_offence',
 'offence',
 'law-breaking',
 'attack',
 'attempt',
 'aggravated_assault',
 'battery',
 'assault_and_battery',
 'resisting_arrest',
 'mugging',
 'barratry',
 'capital_offense',
 'cybercrime',
 'felony',
 'commercial_bribery',
 'housebreaking',
 'break-in',
 'breaking_and_entering',
 'home_invasion',
 'abduction',
 'kidnapping',
 'snatch',
 'blackmail',
 'protection',
 'tribute',
 'shakedown',
 'biopiracy',
 'breach_of_trust_with_fraudulent_intent',
 'embezzlement',
 'peculation',
 'defalcation',
 'misapplication',
 'misappropriation',
 'plunderage',
 'raid',
 'grand_larceny',
 'grand_theft',
 'petit_larceny',
 'petty_larceny',
 'petty',
 'pilferage',
 'robbery',
 'armed_robbery',
 'heist',
 'holdup',
 'stickup',
 'caper',
 'job',
 'dacoity',
 'dakoity',
 'rip-off',
 'highjacking',
 'hijacking',
 'piracy',
 'buccaneering',
 'highway_robbery',
 'rolling',
 'rustling',
 'shoplifting',
 'shrinkage',
 'skimming',
 'forgery',
 '

We didn't use the .hypernyms() command above, but it's useful. 

    .hypernym()

does the opposite of .hyponyms() -- it grabs the larger category, like 'cutlery' for 'fork,' or 'utensil' for 'cutlery.'  So if we take some hypernyms of 'crime' and then add all of their hyponyms, we get a very wide sweep of words for 'offense' or 'law-breaking.'  

If we 

In [97]:
hyper = []
hypersyns = []

for synset in Word('crime').synsets:
    for lemma in synset.lemmas():
        if lemma.name() not in hyper:
            hyper.append(lemma.name())
    for s in synset.hypernyms():
        for lemma in s.lemmas():
            if lemma.name() not in hyper:
                hyper.append(lemma.name())
        for hh in s.hypernyms():
            hypersyns.append(hh)
            for lemma in s.lemmas():
                if lemma.name() not in hyper:
                    hyper.append(lemma.name())      
       
hyper

['crime',
 'offense',
 'criminal_offense',
 'criminal_offence',
 'offence',
 'law-breaking',
 'transgression',
 'evildoing']

In [98]:
    for synset in hypersyns:
        for ss in synset.hyponyms():
                for lemma in ss.lemmas():
                    if lemma.name() not in controlled_vocab:
                        controlled_vocab.append(lemma.name())
                for sss in ss.hyponyms():
                    for lemma in sss.lemmas():
                        if lemma.name() not in controlled_vocab:
                            controlled_vocab.append(lemma.name())
                    for ssss in sss.hyponyms():
                        for lemma in ssss.lemmas():
                            if lemma.name() not in controlled_vocab:
                                controlled_vocab.append(lemma.name())
controlled_vocab

['crime',
 'offense',
 'criminal_offense',
 'criminal_offence',
 'offence',
 'law-breaking',
 'attack',
 'attempt',
 'aggravated_assault',
 'battery',
 'assault_and_battery',
 'resisting_arrest',
 'mugging',
 'barratry',
 'capital_offense',
 'cybercrime',
 'felony',
 'commercial_bribery',
 'housebreaking',
 'break-in',
 'breaking_and_entering',
 'home_invasion',
 'abduction',
 'kidnapping',
 'snatch',
 'blackmail',
 'protection',
 'tribute',
 'shakedown',
 'biopiracy',
 'breach_of_trust_with_fraudulent_intent',
 'embezzlement',
 'peculation',
 'defalcation',
 'misapplication',
 'misappropriation',
 'plunderage',
 'raid',
 'grand_larceny',
 'grand_theft',
 'petit_larceny',
 'petty_larceny',
 'petty',
 'pilferage',
 'robbery',
 'armed_robbery',
 'heist',
 'holdup',
 'stickup',
 'caper',
 'job',
 'dacoity',
 'dakoity',
 'rip-off',
 'highjacking',
 'hijacking',
 'piracy',
 'buccaneering',
 'highway_robbery',
 'rolling',
 'rustling',
 'shoplifting',
 'shrinkage',
 'skimming',
 'forgery',
 '

We've ignored 'murder' somehow in this list of crime, so let's add hyponyms and near matches for 'murder'

In [99]:
# get hyponyms of the hyponyms
for synset in Word('murder').synsets:
    for lemma in synset.lemmas():
        if lemma.name() not in controlled_vocab:
            controlled_vocab.append(lemma.name())
    for s in synset.hyponyms():
        for lemma in s.lemmas():
            if lemma.name() not in controlled_vocab:
                controlled_vocab.append(lemma.name())
        for syn in s.hyponyms():
            for lemma in s.lemmas():
                if lemma.name() not in controlled_vocab:
                    controlled_vocab.append(lemma.name())        
            for ss in syn.hyponyms():
                for lemma in ss.lemmas():
                    if lemma.name() not in controlled_vocab:
                        controlled_vocab.append(lemma.name())
                for sss in ss.hyponyms():
                    for lemma in sss.lemmas():
                        if lemma.name() not in controlled_vocab:
                            controlled_vocab.append(lemma.name())
                    for ssss in sss.hyponyms():
                        for lemma in ssss.lemmas():
                            if lemma.name() not in controlled_vocab:
                                controlled_vocab.append(lemma.name())


for synset in Word('murder').synsets:
    for lemma in synset.lemmas():
        if lemma.name() not in controlled_vocab:
            controlled_vocab.append(lemma.name())
    for s in synset.hypernyms():
        for lemma in s.lemmas():
            if lemma.name() not in controlled_vocab:
                controlled_vocab.append(lemma.name())
            
controlled_vocab[:10]

['crime',
 'offense',
 'criminal_offense',
 'criminal_offence',
 'offence',
 'law-breaking',
 'attack',
 'attempt',
 'aggravated_assault',
 'battery']

How many words is that?

In [100]:
len(controlled_vocab)

335

In [101]:
clean_controlled_vocab = []

for word in controlled_vocab:
    v = word.replace('-', ' ')
    v = v.replace('_', ' ') 
    clean_controlled_vocab.append(str(v))
    
clean_controlled_vocab[:10]

['crime',
 'offense',
 'criminal offense',
 'criminal offence',
 'offence',
 'law breaking',
 'attack',
 'attempt',
 'aggravated assault',
 'battery']

In [118]:
import pandas as pd

vocab = pd.DataFrame(data={"word": clean_controlled_vocab})

In [119]:
vocab.head()

Unnamed: 0,word
0,crime
1,offense
2,criminal offense
3,criminal offence
4,offence


Save a copy of the data for later.

In [120]:
cd ~/digital-history

/users/jguldi/digital-history


In [121]:
vocab.to_csv('crime_vocab.csv', header = True)

## Assignment

1) Read the wordnet documentation.  How do you use wordnet to find antonyms? Call up some antonyms for common words.

2) Find all the words for 'happy' in Jane Austen.

3) Think about ambiguity in the list of words for crime.  Decide on a list of words to use as stopwords for that list -- words that have too many meanings for them to be useful.  Stopword the crime vocab list. Re-save it.

Upload a screenshot of your code and results to Canvas.

Read the WordNet documentation.  Figure out how to find synonyms instead of hyponyms.  Find all the synonyms for 'happy' in Jane Austen.