For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week4-wordnet-controlled-vocab/

# Week 4 Assignment: Working With a Controlled Vocabulary

Inspired by tutorials by Paige McKenzie - https://p-mckenzie.github.io/2018/01/11/Jane-Austen/
William Scott - https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

For the rest of this notebook, we'll be working with a 'controlled vocabulary,' which is to say, expert-defined words that help to limit our pursuit of wordcount to words that share a certain semantic valence.  Controlled Vocabularies have been used in digital history to examine the history of words used by Victorian people to describe the way that strangers walked down the street, and to show that novelists in the nineteenth century described the urban landscape with increasing detail.  

First, we'll download some novels by Jane Austen to try our vocabulary on.  Then, we'll talk about how to clean the text using stemming and lemmatization.  

Next, we'll use a controlled vocabulary to limit the count to words that are interesting to us.  Then, we'll expand that controlled vocabulary using the 'hyponym' feature of the WordNet package, which consults with dictionaries of the English language organized by linguists at Princeton.  

Finally, we'll visualize our findings.


## Download some Jane Austen Novels

In [107]:
import nltk, numpy, re, matplotlib# , num2words

In [108]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [109]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Make sure that your data matches what you think it should.

In [110]:
# printing only first 2000 characters.
sas_data[:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

Looks good!

Isn't it getting tired, retyping the same command for each novel? Let's throw them all into one data set so we can loop through them.


In [111]:

data = [sas_data, emma_data, pap_data]
data[0][:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

There still appear to be some errors where spaces have been replaced by "\n".  We'll get rid of those in a second.

In [112]:
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')
data[0][:2000]

"*       *       *       *       *     CHAPTER I   The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. The late owner of this estate was a single man, who lived to a very advanced age, and who for many years of his life, had a constant companion and housekeeper in his sister. But her death, which happened ten years before his own, produced a great alteration in his home; for to supply her loss, he invited and received into his house the family of his nephew Mr. Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. In the society of his nephew and niece, and their children, the old Gentleman's days were comfortably spent. His attachment to them all increased. The constant attention of Mr. a

Next, let's split the text into words and print the first word of each.

In [113]:
import pandas

for novel in data:
    words = novel.split()
    print(words[:20]) 


['*', '*', '*', '*', '*', 'CHAPTER', 'I', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex.', 'Their', 'estate', 'was']
['Emma', 'Woodhouse,', 'handsome,', 'clever,', 'and', 'rich,', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition,', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best']
['.', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in']


## Cleaning the Novels

Now, let's lowercase the text and get rid of punctuation

In [114]:
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression


### Stemming

Stemming is the process of removing suffices, like "ed" or "ing".

We will use another standard NLTK package, PorterStemmer, to do the stemming.



In [115]:
from nltk.stem import PorterStemmer

st = PorterStemmer()

stemmed_list = []

for novel in data:
    words = novel.split()
    for word in words:
        stemmed = st.stem(word)
        stemmed_list.append(stemmed)
        
stemmed_list[:20] # i have changed this so you print just the first words
# printing all the words is actually way more computer intensive than it may seem

['chapter',
 'i',
 'the',
 'famili',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settl',
 'in',
 'sussex',
 'their',
 'estat',
 'wa',
 'larg',
 'and',
 'their',
 'resid',
 'wa']

As we can see, "universal" becomes "univers" (which means that "universally" will be counted with "universal" and "universe") and "single" becomes "singl" (which means it would be counted with "singled").  But "acknowledged" has been left as it is."

### Lemmatization

Let's pick up another term -- lemmatization -- which is extremely memory intensive, but far more accurate.  

In [116]:
from nltk.corpus import wordnet as wn

wn.morphy('aardwolves')

'aardwolf'

In [117]:
lemma_list = []

for novel in data:
    words = novel.split()
    for word in words:
        lemma = wn.morphy(word)
        if not lemma:
            # word is not a valid english word so skip it
            continue
        lemma_list.append(lemma)

lemma_list[:20]

['chapter',
 'i',
 'family',
 'have',
 'long',
 'be',
 'settle',
 'in',
 'sussex',
 'estate',
 'wa',
 'large',
 'residence',
 'wa',
 'at',
 'park',
 'in',
 'centre',
 'property',
 'many']

Lemmatization is often a more useful approach than stemming because it leverages an understanding of the word itself to convert the word back to its root word. "Acknowledged"  becomes "acknowledge," and "daughters" becomes "daughter."  

Note some important oddities -- words such as "that" are replaced by "None," so if we count lemmas to graph them we will want to eliminate this noise.  

Stemming and lemmatization are important because they matter for how we count.

In [118]:
from collections import Counter


count = Counter(stemmed_list)
print(count.most_common(100))

[('to', 4214), ('the', 4191), ('of', 3692), ('and', 3543), ('her', 2598), ('a', 2161), ('i', 2014), ('in', 1992), ('wa', 1896), ('it', 1896), ('she', 1629), ('be', 1501), ('that', 1403), ('for', 1282), ('not', 1281), ('as', 1247), ('you', 1239), ('he', 1125), ('hi', 1048), ('had', 1032), ('with', 1010), ('have', 920), ('but', 862), ('at', 848), ('is', 780), ('by', 761), ('mr', 758), ('on', 703), ('all', 674), ('so', 661), ('him', 649), ('my', 638), ('elinor', 614), ('which', 600), ('could', 588), ('no', 570), ('from', 554), ('would', 527), ('veri', 525), ('they', 524), ('their', 506), ('mariann', 486), ('them', 484), ('been', 454), ('were', 451), ('what', 443), ('thi', 442), ('me', 429), ('more', 414), ('ani', 409), ('your', 407), ('said', 393), ('everi', 388), ('will', 385), ('such', 373), ('than', 372), ('do', 368), ('or', 360), ('an', 347), ('one', 333), ('when', 317), ('must', 305), ('if', 303), ('much', 301), ('onli', 299), ('own', 284), ('know', 282), ('who', 276), ('time', 264),

# Counting Words and N-Grams

Let's see what the word counts look like now.

In [119]:
from collections import Counter

for novel in data:
    words = novel.split()
    count = Counter(words)
    print(count.most_common(10))

[('the', 4092), ('to', 4090), ('of', 3573), ('and', 3419), ('her', 2522), ('a', 2048), ('i', 1948), ('in', 1937), ('was', 1848), ('it', 1701)]
[('and', 107), ('to', 102), ('a', 92), ('of', 90), ('the', 81), ('her', 61), ('i', 49), ('you', 48), ('it', 45), ('in', 43)]
[('you', 31), ('of', 29), ('to', 22), ('a', 21), ('the', 18), ('and', 17), ('i', 17), ('that', 15), ('it', 14), ('is', 14)]


### N-Grams

Sometimes we want to look for multi-word phrases instead of individual words.  For example, if we're researching the living spaces of Jane Austen's England, we definitely want to know whether she refers to "dining rooms" or "bed-rooms" (which our punctuation clean-up might have turned into separate words, depending on what we did).

In [120]:
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

lemmatizer = WordNetLemmatizer()
three_grams_list = []

for novel in data:
    # Get the first 20 words of the novel.
    words = novel.split(maxsplit=20)
    
    # Delete the last entry of the list as it contains the rest of the novel's text.
    del words[-1]
    
    # Lemmatize
    lemmatized_words = []
    for word in words:
        lemmatized_words.append(lemmatizer.lemmatize(word))
    
    # Join the lemmatized words back into text.
    text = ' '.join(lemmatized_words)
    
    # Collect the n-grams and extend it to our list of n grams
    three_grams = TextBlob(text).ngrams(n=3)
    three_grams_list.extend(three_grams)

three_grams_list[:20]

[WordList(['chapter', 'i', 'the']),
 WordList(['i', 'the', 'family']),
 WordList(['the', 'family', 'of']),
 WordList(['family', 'of', 'dashwood']),
 WordList(['of', 'dashwood', 'had']),
 WordList(['dashwood', 'had', 'long']),
 WordList(['had', 'long', 'been']),
 WordList(['long', 'been', 'settled']),
 WordList(['been', 'settled', 'in']),
 WordList(['settled', 'in', 'sussex']),
 WordList(['in', 'sussex', 'their']),
 WordList(['sussex', 'their', 'estate']),
 WordList(['their', 'estate', 'wa']),
 WordList(['estate', 'wa', 'large']),
 WordList(['wa', 'large', 'and']),
 WordList(['large', 'and', 'their']),
 WordList(['and', 'their', 'residence']),
 WordList(['their', 'residence', 'wa']),
 WordList(['emma', 'woodhouse', 'handsome']),
 WordList(['woodhouse', 'handsome', 'clever'])]

In [121]:
for novel in data:
    bigrams = TextBlob(novel).ngrams(n=2)
    bigram_counter = Counter()
    for bigram in bigrams:
        # Join the bigram into a string as it is a WordList object.
        bigram_text = ' '.join(bigram)
        # Update the count.
        bigram_counter.update([bigram_text])

    print(bigram_counter.most_common(10))


[('to be', 436), ('of the', 431), ('in the', 359), ('it was', 281), ('of her', 277), ('to the', 242), ('mrs jennings', 237), ('to her', 231), ('i am', 224), ('she was', 209)]
[('” “', 29), ('miss taylor', 23), ('mr knightley', 13), ('of her', 12), ('mr weston', 12), ('of the', 10), ('to have', 9), ('it was', 9), ('her father', 9), ('she had', 9)]
[('my dear', 8), ('that he', 6), ('mr bennet', 6), ('you must', 5), ('of them', 5), ('it is', 4), ('do not', 4), ('how can', 4), ('will be', 4), ('of the', 3)]


Notice that the for loop outputs three lists of ten -- the top ten bigrams for each novel.  The output is a 'dictionary' type.  What if you wanted it as a simple list of bigrams?

In [122]:
bigrams_as_text = []

for novel in data:
    bigrams = TextBlob(novel).ngrams(n=2)
    for bigram in bigrams:
        bigram_text = ' '.join(bigram)
        bigrams_as_text.append(bigram_text)

bigrams_as_text[:20]

['chapter i',
 'i the',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in',
 'in sussex',
 'sussex their',
 'their estate',
 'estate was',
 'was large',
 'large and',
 'and their',
 'their residence',
 'residence was',
 'was at']

Now that we have all the bigrams in one list, we can also count the overall top bigrams

In [123]:
bigram_count = Counter(bigrams_as_text)
top_twenty_bigrams = bigram_count.most_common(20)
top_twenty_bigrams

[('to be', 448),
 ('of the', 444),
 ('in the', 369),
 ('of her', 291),
 ('it was', 290),
 ('to the', 244),
 ('mrs jennings', 237),
 ('to her', 234),
 ('i am', 234),
 ('she was', 213),
 ('of his', 209),
 ('i have', 203),
 ('it is', 194),
 ('she had', 193),
 ('could not', 167),
 ('on the', 161),
 ('have been', 161),
 ('in a', 161),
 ('and the', 160),
 ('at the', 160)]

What if we only want the bigrams that include the word "she"?

In [124]:
she_bigrams = []

for bigram in bigrams_as_text:
    if "she" in bigram: # notice the space after she.  It
        she_bigrams.append(bigram)
        
Counter(she_bigrams).most_common(5)

[('she was', 213),
 ('she had', 193),
 ('that she', 122),
 ('as she', 116),
 ('she could', 108)]

Unfortunately, the above code won't work for 'he,' because it will pick up other words that contain 'he.'

In [125]:
he_bigrams = []

for bigram in bigrams_as_text:
    if "he" in bigram: # notice the space after she.  It
        he_bigrams.append(bigram)
        
Counter(he_bigrams).most_common(5)

[('of the', 444),
 ('in the', 369),
 ('of her', 291),
 ('to the', 244),
 ('to her', 234)]

The solution is to use "regular expressions," which are ways of coding the details of language.  You can communicate to the computer about such needs as detecting the beginning or end of a word by using two backslashes (an "escape" to tell the computer not to take the next letter literally) and "b" for "boundary."  If you tell the computer to find a "boundary" in this way, it will look for both spaces and for the end of strings.

Notice how I use two "\\\b"'s below to tell the computer to look for the word "he" but not "her" or "the." Python use the 're' package to detect regular expressions, and the .compile() and .match() commands

In [126]:
import re
pattern = re.compile("\\bhe\\b") #  notice the .compile() and the "escapes"+b to signify "word boundary"

In [127]:
he_bigrams = []

for bigram in bigrams_as_text:
    if pattern.match(bigram): # notice the use of .match()
        he_bigrams.append(bigram)
        
Counter(he_bigrams).most_common(5)

[('he was', 126),
 ('he had', 113),
 ('he is', 75),
 ('he has', 49),
 ('he did', 37)]

# Controlled Vocabulary

Let's look for what scholars call a "controlled vocabulary" -- a list of words that we know to be meaningful. For right now, let's pretend that we're researching the buildings, landscape, and furniture of nineteenth-century England.  I'm curious about what kinds of spaces are described in Austen, and I'd like to begin by counting them.

In [128]:
controlled_vocab = [
    "garden",
    "room", 
    "estate",
    "manor", 
    "hedge", 
    "residence",
    "park",
    "lane",
    "chair",
    "sofa",
    "settee",
    "bed",
    "bedroom",
    "chaise",
    "table",
    "rug",
    "carpet",
    "candelabra",
    "shed",
    "cottage",
    "fence",
    "turret",
    "castle",
    "palace",
    "hut",
    "dwelling"
]

In [129]:
controlled_words = []


words = data[0].split()

for w in words:
    if w in controlled_vocab:
        controlled_words.append(w)

Counter(controlled_words)

Counter({'estate': 19,
         'residence': 7,
         'park': 51,
         'dwelling': 6,
         'room': 97,
         'cottage': 56,
         'garden': 11,
         'shed': 3,
         'table': 23,
         'manor': 1,
         'chair': 9,
         'bed': 25,
         'lane': 3,
         'chaise': 6,
         'rug': 1,
         'bedroom': 1,
         'sofa': 1})

That's not a very good return.  It also occurs to me that I might not be thinking clearly about all the kinds of furniture, buildings, and other structures that might make up the Georgian landscape.  Fortunately, linguists have compiled many dictionaries that can help us to navigate the semantic universe with greater position.  One of these dictionaries is "Wordnet," the fruit of a long-term research undertaking at Princeton. 

# Expanded Controlled Vocabulary with Wordnet

The 'get_synsets' command in Wordnet unlocks the thesaurus/dictionary in its full potential.  We won't go into the full power of the "synsets," but suffice it to say that Wordnet knows that a "house" when used as a noun can mean a "firm," a "sign of the zodiac," a "family," or a "theater."

In [130]:
from textblob import Word

from textblob.wordnet import NOUN

w1 = Word("house")
w1.synsets
syns = w1.get_synsets(pos=NOUN)
print(syns)


[Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12')]


Likewise, wordnet knows that the word "building" can refer to different kinds of construction (as a noun), but it can also be a verb form used with many different senses.

In [131]:
wn.synsets('building')

[Synset('building.n.01'),
 Synset('construction.n.01'),
 Synset('construction.n.07'),
 Synset('building.n.04'),
 Synset('construct.v.01'),
 Synset('build_up.v.02'),
 Synset('build.v.03'),
 Synset('build.v.04'),
 Synset('build.v.05'),
 Synset('build.v.06'),
 Synset('build.v.07'),
 Synset('build.v.08'),
 Synset('build_up.v.04'),
 Synset('build.v.10')]

A *hyponym* is a word that is a more specific version of another word.  So if we want to know the many different types of houses in the dictionary, we can use wordnet's .hyponyms() command to navigate these lists, and we can generate another controlled vocabulary from them.

In [132]:
synlist = wn.synset('house.n.01').hyponyms()
synlist

[Synset('beach_house.n.01'),
 Synset('boarding_house.n.01'),
 Synset('bungalow.n.01'),
 Synset('cabin.n.02'),
 Synset('chalet.n.01'),
 Synset('chapterhouse.n.02'),
 Synset('country_house.n.01'),
 Synset('detached_house.n.01'),
 Synset('dollhouse.n.01'),
 Synset('duplex_house.n.01'),
 Synset('farmhouse.n.01'),
 Synset('gatehouse.n.01'),
 Synset('guesthouse.n.01'),
 Synset('hacienda.n.02'),
 Synset('lodge.n.04'),
 Synset('lodging_house.n.01'),
 Synset('maisonette.n.02'),
 Synset('mansion.n.02'),
 Synset('ranch_house.n.01'),
 Synset('residence.n.02'),
 Synset('row_house.n.01'),
 Synset('safe_house.n.01'),
 Synset('saltbox.n.01'),
 Synset('sod_house.n.01'),
 Synset('solar_house.n.01'),
 Synset('tract_house.n.01'),
 Synset('villa.n.02')]

Wordnet's 'lemmas()' function gives us access to the base lemma associated with any of these categories.  Let's use the 'append' function and the 'lemmas' function to create a vocabulary list stripped of the Wordnet apparatus.  

In [133]:
new_vocab = []

for syn in synlist:
    for lemma in syn.lemmas():
        new_vocab.append(str(lemma.name()))
        
print(new_vocab)

['beach_house', 'boarding_house', 'boardinghouse', 'bungalow', 'cottage', 'cabin', 'chalet', 'chapterhouse', 'fraternity_house', 'frat_house', 'country_house', 'detached_house', 'single_dwelling', 'dollhouse', "doll's_house", 'duplex_house', 'duplex', 'semidetached_house', 'farmhouse', 'gatehouse', 'guesthouse', 'hacienda', 'lodge', 'hunting_lodge', 'lodging_house', 'rooming_house', 'maisonette', 'maisonnette', 'mansion', 'mansion_house', 'manse', 'hall', 'residence', 'ranch_house', 'residence', 'row_house', 'town_house', 'safe_house', 'saltbox', 'sod_house', 'soddy', 'adobe_house', 'solar_house', 'tract_house', 'villa']


Bear in mind: we don't have to stop here.  We can keep drilling down within each of these catergories to get an even finer-grain list.

In [134]:
for syn in synlist:
    print(syn.lemmas())

[Lemma('beach_house.n.01.beach_house')]
[Lemma('boarding_house.n.01.boarding_house'), Lemma('boarding_house.n.01.boardinghouse')]
[Lemma('bungalow.n.01.bungalow'), Lemma('bungalow.n.01.cottage')]
[Lemma('cabin.n.02.cabin')]
[Lemma('chalet.n.01.chalet')]
[Lemma('chapterhouse.n.02.chapterhouse'), Lemma('chapterhouse.n.02.fraternity_house'), Lemma('chapterhouse.n.02.frat_house')]
[Lemma('country_house.n.01.country_house')]
[Lemma('detached_house.n.01.detached_house'), Lemma('detached_house.n.01.single_dwelling')]
[Lemma('dollhouse.n.01.dollhouse'), Lemma('dollhouse.n.01.doll's_house')]
[Lemma('duplex_house.n.01.duplex_house'), Lemma('duplex_house.n.01.duplex'), Lemma('duplex_house.n.01.semidetached_house')]
[Lemma('farmhouse.n.01.farmhouse')]
[Lemma('gatehouse.n.01.gatehouse')]
[Lemma('guesthouse.n.01.guesthouse')]
[Lemma('hacienda.n.02.hacienda')]
[Lemma('lodge.n.04.lodge'), Lemma('lodge.n.04.hunting_lodge')]
[Lemma('lodging_house.n.01.lodging_house'), Lemma('lodging_house.n.01.rooming_h

In [135]:
finer_syns = []

for syn in synlist:
    hypo = syn.hyponyms()
    for h in hypo:
        finer_syns.append(h)
 #   print(syn.hyponyms())
  
print(finer_syns)

[Synset('bed_and_breakfast.n.01'), Synset('log_cabin.n.01'), Synset('chateau.n.01'), Synset('dacha.n.01'), Synset('shooting_lodge.n.01'), Synset('summer_house.n.01'), Synset('villa.n.03'), Synset('villa.n.04'), Synset('lodge.n.03'), Synset('flophouse.n.01'), Synset('manor.n.01'), Synset('palace.n.01'), Synset('stately_home.n.01'), Synset('court.n.09'), Synset('deanery.n.01'), Synset('manse.n.02'), Synset('palace.n.04'), Synset('parsonage.n.01'), Synset('religious_residence.n.01'), Synset('brownstone.n.02'), Synset('terraced_house.n.01')]


In [136]:
new_vocab_finer = []

for syn in finer_syns:
    for subsyn in syn.lemmas():
        w = subsyn.name()
        new_vocab_finer.append(str(w))
            


new_vocab_finer

['bed_and_breakfast',
 'bed-and-breakfast',
 'log_cabin',
 'chateau',
 'dacha',
 'shooting_lodge',
 'shooting_box',
 'summer_house',
 'villa',
 'villa',
 'lodge',
 'flophouse',
 'dosshouse',
 'manor',
 'manor_house',
 'palace',
 'castle',
 'stately_home',
 'court',
 'deanery',
 'manse',
 'palace',
 'parsonage',
 'vicarage',
 'rectory',
 'religious_residence',
 'cloister',
 'brownstone',
 'terraced_house']

In [137]:
controlled_vocab+=new_vocab_finer
controlled_vocab+=new_vocab
controlled_vocab


['garden',
 'room',
 'estate',
 'manor',
 'hedge',
 'residence',
 'park',
 'lane',
 'chair',
 'sofa',
 'settee',
 'bed',
 'bedroom',
 'chaise',
 'table',
 'rug',
 'carpet',
 'candelabra',
 'shed',
 'cottage',
 'fence',
 'turret',
 'castle',
 'palace',
 'hut',
 'dwelling',
 'bed_and_breakfast',
 'bed-and-breakfast',
 'log_cabin',
 'chateau',
 'dacha',
 'shooting_lodge',
 'shooting_box',
 'summer_house',
 'villa',
 'villa',
 'lodge',
 'flophouse',
 'dosshouse',
 'manor',
 'manor_house',
 'palace',
 'castle',
 'stately_home',
 'court',
 'deanery',
 'manse',
 'palace',
 'parsonage',
 'vicarage',
 'rectory',
 'religious_residence',
 'cloister',
 'brownstone',
 'terraced_house',
 'beach_house',
 'boarding_house',
 'boardinghouse',
 'bungalow',
 'cottage',
 'cabin',
 'chalet',
 'chapterhouse',
 'fraternity_house',
 'frat_house',
 'country_house',
 'detached_house',
 'single_dwelling',
 'dollhouse',
 "doll's_house",
 'duplex_house',
 'duplex',
 'semidetached_house',
 'farmhouse',
 'gatehouse',

That's great, but if we look for  phrases like "road_construction" in our list of Jane Austen words, we'll run into trouble -- because we've already cleaned the text of Jane Austen so that there are no underscores or hyphens.

Because of this, we need to produce a clean list from controlled_vocab.

In [138]:
cleaned_controlled_vocab = []
for w in controlled_vocab:
    cleaned = w.replace("_", " ").replace("-", " ") # remove hyphens or underscores
    cleaned_controlled_vocab.append(cleaned)

import numpy as np
cleaned_controlled_vocab = set(cleaned_controlled_vocab) # get only unique values
cleaned_controlled_vocab

{'adobe house',
 'beach house',
 'bed',
 'bed and breakfast',
 'bedroom',
 'boarding house',
 'boardinghouse',
 'brownstone',
 'bungalow',
 'cabin',
 'candelabra',
 'carpet',
 'castle',
 'chair',
 'chaise',
 'chalet',
 'chapterhouse',
 'chateau',
 'cloister',
 'cottage',
 'country house',
 'court',
 'dacha',
 'deanery',
 'detached house',
 "doll's house",
 'dollhouse',
 'dosshouse',
 'duplex',
 'duplex house',
 'dwelling',
 'estate',
 'farmhouse',
 'fence',
 'flophouse',
 'frat house',
 'fraternity house',
 'garden',
 'gatehouse',
 'guesthouse',
 'hacienda',
 'hall',
 'hedge',
 'hunting lodge',
 'hut',
 'lane',
 'lodge',
 'lodging house',
 'log cabin',
 'maisonette',
 'maisonnette',
 'manor',
 'manor house',
 'manse',
 'mansion',
 'mansion house',
 'palace',
 'park',
 'parsonage',
 'ranch house',
 'rectory',
 'religious residence',
 'residence',
 'room',
 'rooming house',
 'row house',
 'rug',
 'safe house',
 'saltbox',
 'semidetached house',
 'settee',
 'shed',
 'shooting box',
 'shoo

Next, let's run a "for" loop similar to those we've seen before to search for the words in cleaned_controlled_vocab that also appear in Jane Austen novels.  As you'll recall, we have a master list of Austen words called "words."

In [148]:
controlled_words = []

for w in words:
    for v in cleaned_controlled_vocab: 
        if w in v:
            controlled_words.append(v)

Counter(controlled_words)

Counter({'chapterhouse': 2358,
         'residence': 2011,
         'dwelling': 4310,
         'cabin': 5933,
         'chair': 4040,
         'maisonnette': 6093,
         'maisonette': 6410,
         'shooting box': 3886,
         'hacienda': 4049,
         'cloister': 2691,
         'log cabin': 5935,
         'hunting lodge': 3888,
         'shooting lodge': 3887,
         'mansion': 5140,
         'rooming house': 4242,
         'single dwelling': 4320,
         'boarding house': 6195,
         'semidetached house': 5348,
         'religious residence': 2102,
         'vicarage': 4018,
         'boardinghouse': 6195,
         'fraternity house': 6802,
         'chaise': 4742,
         'lodging house': 4149,
         'mansion house': 5400,
         'villa': 4020,
         'sofa': 6259,
         'estate': 2929,
         'candelabra': 6015,
         'bed and breakfast': 8356,
         'frat house': 3143,
         'stately home': 3401,
         'gatehouse': 3148,
         'chateau': 2

Hmm, I'm not sure that's right.  There aren't so many hacienda's or log cabin's in Jane Austen.  What might have gone wrong?  Perhaps it's a regular expressions problem again -- the computer is looking for "chapter" and finds "chapterhouse."  How can we tweak the code above to be more accurate?  

The answer involves the difference between "if a in b" and "if a==b".  

The former searches for cases where a is "in" b.  The latter searches for cases where a is an exact match for b.

*Notice that we can get even more control over matches -- by looking for word boundaries before and after a and b -- by using "regular expression" or "regex" language with the .compile() and .match() commands above.  In future iterations of exercises, you may want to learn more about this in order to gain total control over pattern matching with language; you can find an introduction to the "re" package for regex here: https://docs.python.org/3/howto/regex.html

In [149]:
controlled_words2 = []

for w in words:
    for v in cleaned_controlled_vocab: 
        if w == v: # notice what I changed in this line
            controlled_words2.append(v)

Counter(controlled_words2)

Counter({'estate': 19,
         'residence': 7,
         'park': 51,
         'dwelling': 6,
         'room': 97,
         'cottage': 56,
         'garden': 11,
         'shed': 3,
         'court': 4,
         'table': 23,
         'manor': 1,
         'mansion': 1,
         'chair': 9,
         'bed': 25,
         'lane': 3,
         'chaise': 6,
         'rug': 1,
         'bedroom': 1,
         'rectory': 2,
         'parsonage': 9,
         'sofa': 1,
         'hall': 1})

In this case, I'm only counting words that *exactly match* my controlled vocab.  

Again, I could also use regular expressions and the commands .compile() and .match() (see bigrams section above) to tailor exactly what is being looked for, depending on whether I'm looking for exact matches of one string within another or exact matches of one string and another string.

In [158]:
controlled_words3 = []

for w in words:
    for v in cleaned_controlled_vocab: 
        pattern = re.compile(r'\b%s\b' % w, re.I) # notice what I changed in this line
        if pattern.match(v): # notice what I changed in this line
            controlled_words3.append(v)

Counter(controlled_words3)



Counter({'estate': 19,
         'residence': 7,
         'park': 51,
         'single dwelling': 8,
         'dwelling': 6,
         'room': 97,
         'cottage': 56,
         'garden': 11,
         'shed': 3,
         'country house': 32,
         'court': 4,
         'table': 23,
         'summer house': 3,
         'manor': 1,
         'manor house': 1,
         'safe house': 9,
         'mansion': 1,
         'mansion house': 1,
         'chair': 9,
         'town house': 82,
         'bed and breakfast': 25,
         'bed': 25,
         'lane': 3,
         'chaise': 6,
         'rug': 1,
         'bedroom': 1,
         'lodging house': 2,
         'rectory': 2,
         'parsonage': 9,
         'tract house': 1,
         'sofa': 1,
         'hall': 1})

Controlled_words3 is giving us output where any *complete words* found in Jane Austen are also found in a controlled_vocab string.  Thus "town house" appears because "house" is found in Jane Austen.  But this output is also probably inaccurate: "safe house" doesn't sound like a circa 1820 term that Austen would have included in her novels. 

All I'm doing here is demonstrating the various different outputs you can get from similar searches, using "in", "==", and regex pattern matching.  You could keep tweaking the code, but so far, the best results are from our second example, controlled_words2, where we used the exact match operator, "==", to match single words in Jane Austen against exact entries in the controlled_vocab.

There's only one problem with the solution thus far.  The variable *words* contains strings that are one-word long, but we're matching it with cleaned_controlled_vocab -- which includes such two-word phrases as "terraced house." 

Those two-word phrases aren't getting accurately matched. We could fix this by searching the variable "bigrams_as_text" for the cleaned_controlled_vocab.  In fact, this is exactly what you'll be doing in one part of the exercise that follows.

# Exercise 

*To be turned in on Canvas*

1) First, let's have you get some more terminology of houses or furniture to add to cleaned_controlled_vocab.  Up at the top of this exercise, where you import the 'synsets' -- lists of related words -- from Wordnet, we looked at how to load multiple synsets that contain the word 'building.'  We only took the hyponyms from the first synset -- house.n.01.  But you could take the hyponyms from another synset.  Alternatively, you could look for synsets that contain the word 'furniture' or 'garden.'  What you choose is up to you, but try playing with Wordnet a little.  Once you have a synset that you like, run it through the code that follows so that you have a new 'cleaned_controlled_vocab' that you can use to search Jane Austen.

2) Next, find the bigrams (two-word phrases) in Jane Austen that contain any of these lemmas or phrases in 'cleaned_controlled_vocab'.  Sort the phrases by descending frequency, and paste the top twenty into Canvas.

3) Write an interpretive paragraph of at least five sentences making some observations about the built landscape of England at the time of Jane Austen.  Offset phrases and words found in the text with quotation marks.

### Help where help is needed

If you're finding yourself confused about the code and how to follow directions at this point, bear in mind that we're moving very quickly through the introduction to Python. You might need to slow down and revisit some of the "optional" notebooks that we mentioned in Weeks 1-2.  Here they are again:

- lists : https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/lists.ipynb
- for loops : https://problemsolvingwithpython.com/09-Loops/09.01-For-Loops/
- expressions and strings :  https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/expressions-and-strings.ipynb
- dictionaries, sets, tuples: https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/dictionaries-sets-tuples.ipynb
- counting things: https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/counting.ipynb

Remember that SMU expects you to be spending around 6 hours every week on your homework for this class.  Don't be afraid to keep tweaking the code until it works -- or reaching out on Slack if you need encouragement from others.  

Also, please bear in mind that everyone who learns how to code ultimately does so through a lot of trial and error.  Try typing in code and running it. When you run into trouble, you can google your problems and find stack overflow results or blog entries that match your problem and suggest solutions.  The more you try, the faster you will master code.  

Don't give up!  Keep trying things until you feel like you're getting it! 

