For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-controlled-vocab

# Week 3 Assignment: For Loops, Expert Vocabulary, and Revising With a Thesaurus

In this week's assignment, you'll learn how to loop over lists of data.  You'll also start the process of thinking critically about which words matter to you for the purposes of text mining, and how to use a thesaurus and the powers of reason to expand your expert vocabulary and divide it into categories of information. 

## Iterating over lists with for

[based on Lauren Klein's Lists and Loops https://github.com/laurenfklein/emory-qtm340/tree/0c3d0935ecd0a7920e331a8efd78240c49997606/notebooks]

The list comprehension syntax discussed earlier is very powerful: it allows you to succinctly transform one list into another list by thinking in terms of filtering and modification. But sometimes your primary goal isn't to make a new list, but simply to perform a set of operations on an existing list.

Let's say that you want to print every string in a list. Here's a short text:

In [22]:
text = "it was the best of times, it was the worst of times"

We can make a list of all the words in the text by splitting on whitespace:

In [23]:
words = text.split()

Of course, we can see what's in the list simply by evaluating the variable:

In [24]:
words

['it',
 'was',
 'the',
 'best',
 'of',
 'times,',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times']

But let's say that we want to print out each word on a separate line, without any of Python's weird punctuation. In other words, I want the output to look like:


    it
    was
    the
    best
    of
    times,
    it
    was
    the
    worst
    of
    times

But how can this be accomplished? We know that the print() function can display an individual string in this manner:

In [25]:
print("hello")

hello


So what we need, clearly, is a way to call the print() function with every item of the list. We could do this by writing a series of print() statements, one for every item in the list:

In [26]:
print(words[0])
print(words[1])
print(words[2])
print(words[3])
print(words[4])
print(words[5])
print(words[6])
print(words[7])
print(words[8])
print(words[9])
print(words[10])
print(words[11])

it
was
the
best
of
times,
it
was
the
worst
of
times



Nice, but there are some problems with this approach:

- It's kind of verbose---we're doing exactly the same thing multiple times, only with slightly different expressions. Surely there's an easier way to tell the computer to do this?
- It doesn't scale. What if we wrote a program that we want to produce hundreds or thousands of lines. Would we really need to write a print statement for each of those expressions?
- It requires us to know how many items are going to end up in the list to begin with.

Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common, that Python has some built-in syntax to make the task easy: the for loop.

Here's how a for loop looks:

for tempvar in sourcelist:
    statements

The words for and in just have to be there---that's how Python knows it's a for loop. Here's what each of those parts mean.

    tempvar: A name for a variable. Inside of the for loop, this variable will contain the current item of the list.
    sourcelist: This can be any Python expression that evaluates to a list---a variable that contains a list, or a list slice, or even a list literal that you just type right in!
    statements: One or more Python statements. Everything tabbed over underneath the for will be executed once for each item in the list. The statements tabbed over underneath the for line are called the body of the loop.

Here's what the for loop for printing out every item in a list might look like:

In [27]:
for item in words:
    print(item)

it
was
the
best
of
times,
it
was
the
worst
of
times


The variable name item is arbitrary. You can pick whatever variable name you like, as long as you're consistent about using the same variable name in the body of the loop. If you wrote out this loop in a long-hand fashion, it might look like this:


    item = words[0]
    print(item)
    item = words[1]
    print(item)
    item = words[2]
    print(item)
    item = words[3]
    print(item)
    # etc.


    
    it
    was
    the
    best
    
Of course, the body of the loop can have more than one statement, and you can assign values to variables inside the loop:


In [28]:
for item in words:
    yelling = item.upper()
    print(yelling)

IT
WAS
THE
BEST
OF
TIMES,
IT
WAS
THE
WORST
OF
TIMES


You can also include other kinds of nested statements inside the for loop, like if/else:

In [29]:

for item in words:
    if len(item) == 2:
        print(item.upper())
    elif len(item) == 3:
        print("   " + item)
    else:
        print(item)

IT
   was
   the
best
OF
times,
IT
   was
   the
worst
OF
times


This structure is called a "loop" because when Python reaches the end of the statements in the body, it "loops" back to the beginning of the body, and executes the same statements again (this time with the next item in the list).


Python programmers tend to use for loops most often when the problem would otherwise be too tricky or complicated to solve using a list comprehension. It's easy to paraphrase any list comprehension in for loop syntax. For example, this list comprehension, which evaluates to a list of the squares of even integers from 1 to 25:


In [30]:
[x * x for x in range(1, 26) if x % 2 == 0]


[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

You can rewrite this list comprehesion as a for loop by starting out with an empty list, then appending an item to the list inside the loop. The source list remains the same:


In [31]:
result = []
for x in range(1, 26):
    if x % 2 == 0:
        result.append(x * x)
result

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

## Join: Making strings from lists

Once we've created a list of words, it's a common task to want to take that list and "glue" it back together, so it's a single string again, instead of a list. So, for example:

In [32]:
element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
glue = ", and "
glue.join(element_list)

'hydrogen, and helium, and lithium, and beryllium, and boron'

The .join() method needs a "glue" string to the left of it---this is the string that will be placed in between the list elements. In the parentheses to the right, you need to put an expression that evaluates to a list. Very frequently with .join(), programmers don't bother to assign the "glue" string to a variable first, so you end up with code that looks like this:


In [33]:
words = ["this", "is", "a", "test"]
" ".join(words)

'this is a test'


When we're working with .split() and .join(), our workflow usually looks something like this:

    Split a string to get a list of units (usually words).
    Use some of the list operations discussed above to modify or slice the list.
    Join that list back together into a string.
    Do something with that string (e.g., print it out).

With this in mind, here's a program that splits a string into words, randomizes the order of the words, then prints out the results:


In [34]:
# to make this block work:

# add `import random`, the module `shuffle()` belongs to. 

# remove `split()` bc the `shuffle()` method only works on lists, not string objects (and `split()` transforms items to string objects)

# if you want to keep demonstrating `.split()` with `shuffle()` you could transform the str objects to lists, but that step might be hard to follow logically 

# alterantively you could use `sort()` instead of suffle (see below)

import random

text = "it was a dark and stormy night"
# words = text.split() 
random.shuffle(words)
' '.join(words)

'this a test is'

In [35]:
# sort option w str split

text = "it was a dark and stormy night"
words = text.split()
words.sort()
for word in words:
    print(word)

a
and
dark
it
night
stormy
was


EXERCISE: Write a Python command-line program that prints out the lines of a text file in random order.

## Nested For Loops

Sometimes, I want to use multiple for loops to do my business.  This usually happens when data is 'structured,' that is, when the data exists in multiple separate lists, dictionaries, or dataframes, to each of which we want to apply separate conditions.

First, let's make a set of lists.  Each list contains a set of strings that correspond to the names of novels written by three novelists.

In [36]:
dickens = ['oliver twist', 'bleak house', 'a tale of two cities']
austen = ['sense and sensibility', 'emma', 'pride and prejudice']
trollope = ['doctor thorne', 'barchester towers', 'the land leaguers']

In [37]:
austen[1]

'emma'

Now, let's make a list of the novelists' names.

In [38]:
novelists = ['dickens', 'austen', 'trollope']

Importantly, I can now call up novels by the strings in the variable 'novelists.'  Here are two ways of getting Dickens' novels:

In [39]:
globals()['dickens'] # this looks for variables called 'dickens'

['oliver twist', 'bleak house', 'a tale of two cities']

In [40]:
globals()[novelists[0]] # this looks for variables that correspond to the first item in the list, 'novelists'

['oliver twist', 'bleak house', 'a tale of two cities']

same thing.

Let's put that into a for loop:

In [41]:
for novelist in novelists: # cycle through each novelist
    their_novels = globals()[novelist] # for each novelist, pull up the list that corresponds to their name -- thus for 'dickens,' call up the variable called 'dickens'
    print(novelist)
    print(their_novels)

dickens
['oliver twist', 'bleak house', 'a tale of two cities']
austen
['sense and sensibility', 'emma', 'pride and prejudice']
trollope
['doctor thorne', 'barchester towers', 'the land leaguers']


That's nicely formatted output. 

But say we want to do something which each of the novel names -- like creating a new dataset where each novel name is accurately annotate it with the name of its author.  How do I glue them together, when what I want to glue changes for each novel but also for each novelist? 

To do that, we'll need a *double* for loop, or a "nested" for loop.  

The outside for loop cycles through each novelist and calls up their list of novels in the variable 'their_novels'.

The inner for loop cycles through each of the items in 'their_novels."

I can use these nested for loops to output a really nicely formatted list of authors and novels.

In [42]:

for novelist in novelists: # for each novelist,
    their_novels = globals()[novelist] 
    for novel in their_novels: # cycle through each of the novels for that novelist. for each novel of each novelist:
        print(novel)
        print(novelist)
        print("")
    print("------------------")


oliver twist
dickens

bleak house
dickens

a tale of two cities
dickens

------------------
sense and sensibility
austen

emma
austen

pride and prejudice
austen

------------------
doctor thorne
trollope

barchester towers
trollope

the land leaguers
trollope

------------------


That's a nicely formatted print-out.  

But what I really want is a dataset where every entry is the name of a novelist and the novel they wrote.  Can I tweak the double for loop to do that?

In [43]:
novels_with_authors = []

for novelist in novelists: # for each novelist,
    their_novels = globals()[novelist] 
    for novel in their_novels: # cycle through each of the novels for that novelist. for each novel of each novelist:
        new_entry = novelist + "-" + novel # create a dummy variable, 'new_entry', which lists the novelist and novel
        novels_with_authors.append(new_entry) # add the dummy variable to my master list, novels_with_authors
        
novels_with_authors

['dickens-oliver twist',
 'dickens-bleak house',
 'dickens-a tale of two cities',
 'austen-sense and sensibility',
 'austen-emma',
 'austen-pride and prejudice',
 'trollope-doctor thorne',
 'trollope-barchester towers',
 'trollope-the land leaguers']

Notice that I produced this output -- and the output above -- with a "double for loop" or a "nested for loop."

The first "for loop" iterates through the novelists, one at a time:


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"for novelist in novelists:"


The second "for loop" takes each novelist, and iterates through the list of their novels:
       
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "for novel in their_novels"


Because the loops are nested, I'm not randomly applying Trollope or Austen's names to random novel titles; I'm creating a list where each author's name corresponds to the right novel.

In [44]:
novels_with_authors

['dickens-oliver twist',
 'dickens-bleak house',
 'dickens-a tale of two cities',
 'austen-sense and sensibility',
 'austen-emma',
 'austen-pride and prejudice',
 'trollope-doctor thorne',
 'trollope-barchester towers',
 'trollope-the land leaguers']

# Working with a Controlled Vocabulary 

Inspired by tutorials by Paige McKenzie - https://p-mckenzie.github.io/2018/01/11/Jane-Austen/
William Scott - https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

For the rest of this notebook, we'll be working with a 'controlled vocabulary,' which is to say, expert-defined words that help to limit our pursuit of wordcount to words that share a certain semantic valence.  Controlled Vocabularies have been used in digital history to examine the history of words used by Victorian people to describe the way that strangers walked down the street, and to show that novelists in the nineteenth century described the urban landscape with increasing detail.  

First, we'll download some novels by Jane Austen to try our vocabulary on.  Then, we'll talk about how to clean the text using stemming and lemmatization.  

Next, we'll use a controlled vocabulary to limit the count to words that are interesting to us.  Then, we'll expand that controlled vocabulary using the 'hyponym' feature of the WordNet package, which consults with dictionaries of the English language organized by linguists at Princeton.  

Finally, we'll visualize our findings.


## Download some Jane Austen Novels

In [45]:
import nltk, numpy, re, matplotlib# , num2words

In [46]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [47]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Make sure that your data matches what you think it should.

In [48]:
# printing only first 2000 characters.
sas_data[:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

Looks good!

Isn't it getting tired, retyping the same command for each novel? Let's throw them all into one data set so we can loop through them.


In [49]:

data = [sas_data, emma_data, pap_data]
data[0][:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

There still appear to be some errors where spaces have been replaced by "\n".  We'll get rid of those in a second.

In [50]:
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')
data[0][:2000]

"*       *       *       *       *     CHAPTER I   The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. The late owner of this estate was a single man, who lived to a very advanced age, and who for many years of his life, had a constant companion and housekeeper in his sister. But her death, which happened ten years before his own, produced a great alteration in his home; for to supply her loss, he invited and received into his house the family of his nephew Mr. Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. In the society of his nephew and niece, and their children, the old Gentleman's days were comfortably spent. His attachment to them all increased. The constant attention of Mr. a

Next, let's split the text into words and print the first word of each.

In [51]:
import pandas

for novel in data:
    words = novel.split()
    print(words[:20]) 


['*', '*', '*', '*', '*', 'CHAPTER', 'I', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex.', 'Their', 'estate', 'was']
['Emma', 'Woodhouse,', 'handsome,', 'clever,', 'and', 'rich,', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition,', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best']
['.', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in']


## Cleaning the Novels

Now, let's lowercase the text and get rid of punctuation

In [52]:
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression


### Stemming

Stemming is the process of removing suffices, like "ed" or "ing".

We will use another standard NLTK package, PorterStemmer, to do the stemming.



In [53]:
from nltk.stem import PorterStemmer

st = PorterStemmer()

stemmed_list = []

for novel in data:
    words = novel.split()
    for word in words:
        stemmed = st.stem(word)
        stemmed_list.append(stemmed)
        
stemmed_list[:20] # i have changed this so you print just the first words
# printing all the words is actually way more computer intensive than it may seem

['chapter',
 'i',
 'the',
 'famili',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settl',
 'in',
 'sussex',
 'their',
 'estat',
 'wa',
 'larg',
 'and',
 'their',
 'resid',
 'wa']

As we can see, "universal" becomes "univers" (which means that "universally" will be counted with "universal" and "universe") and "single" becomes "singl" (which means it would be counted with "singled").  But "acknowledged" has been left as it is."

### Lemmatization

Let's pick up another term -- lemmatization -- which is extremely memory intensive, but far more accurate.  

In [54]:
from nltk.corpus import wordnet as wn

wn.morphy('aardwolves')

'aardwolf'

In [55]:
lemma_list = []

for novel in data:
    words = novel.split()
    for word in words:
        lemma = wn.morphy(word)
        if not lemma:
            # word is not a valid english word so skip it
            continue
        lemma_list.append(lemma)

lemma_list[:20]

['chapter',
 'i',
 'family',
 'have',
 'long',
 'be',
 'settle',
 'in',
 'sussex',
 'estate',
 'wa',
 'large',
 'residence',
 'wa',
 'at',
 'park',
 'in',
 'centre',
 'property',
 'many']

Lemmatization is often a more useful approach than stemming because it leverages an understanding of the word itself to convert the word back to its root word. "Acknowledged"  becomes "acknowledge," and "daughters" becomes "daughter."  

Note some important oddities -- words such as "that" are replaced by "None," so if we count lemmas to graph them we will want to eliminate this noise.  

Stemming and lemmatization are important because they matter for how we count.

In [56]:
count = Counter(stemmed_list)
print(count.most_common(100))

[('to', 4214), ('the', 4191), ('of', 3692), ('and', 3543), ('her', 2598), ('a', 2161), ('i', 2014), ('in', 1992), ('wa', 1896), ('it', 1896), ('she', 1629), ('be', 1501), ('that', 1403), ('for', 1282), ('not', 1281), ('as', 1247), ('you', 1239), ('he', 1125), ('hi', 1048), ('had', 1032), ('with', 1010), ('have', 920), ('but', 862), ('at', 848), ('is', 780), ('by', 761), ('mr', 758), ('on', 703), ('all', 674), ('so', 661), ('him', 649), ('my', 638), ('elinor', 614), ('which', 600), ('could', 588), ('no', 570), ('from', 554), ('would', 527), ('veri', 525), ('they', 524), ('their', 506), ('mariann', 486), ('them', 484), ('been', 454), ('were', 451), ('what', 443), ('thi', 442), ('me', 429), ('more', 414), ('ani', 409), ('your', 407), ('said', 393), ('everi', 388), ('will', 385), ('such', 373), ('than', 372), ('do', 368), ('or', 360), ('an', 347), ('one', 333), ('when', 317), ('must', 305), ('if', 303), ('much', 301), ('onli', 299), ('own', 284), ('know', 282), ('who', 276), ('time', 264),

# Counting Words and N-Grams

Let's see what the word counts look like now.

In [57]:
from collections import Counter

for novel in data:
    words = novel.split()
    count = Counter(words)
    print(count.most_common(10))

[('the', 4092), ('to', 4090), ('of', 3573), ('and', 3419), ('her', 2522), ('a', 2048), ('i', 1948), ('in', 1937), ('was', 1848), ('it', 1701)]
[('and', 107), ('to', 102), ('a', 92), ('of', 90), ('the', 81), ('her', 61), ('i', 49), ('you', 48), ('it', 45), ('in', 43)]
[('you', 31), ('of', 29), ('to', 22), ('a', 21), ('the', 18), ('and', 17), ('i', 17), ('that', 15), ('it', 14), ('is', 14)]


### N-Grams

Sometimes we want to look for multi-word phrases instead of individual words.  For example, if we're researching the living spaces of Jane Austen's England, we definitely want to know whether she refers to "dining rooms" or "bed-rooms" (which our punctuation clean-up might have turned into separate words, depending on what we did).

In [58]:
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

lemmatizer = WordNetLemmatizer()
three_grams_list = []

for novel in data:
    # Get the first 20 words of the novel.
    words = novel.split(maxsplit=20)
    
    # Delete the last entry of the list as it contains the rest of the novel's text.
    del words[-1]
    
    # Lemmatize
    lemmatized_words = []
    for word in words:
        lemmatized_words.append(lemmatizer.lemmatize(word))
    
    # Join the lemmatized words back into text.
    text = ' '.join(lemmatized_words)
    
    # Collect the n-grams and extend it to our list of n grams
    three_grams = TextBlob(text).ngrams(n=3)
    three_grams_list.extend(three_grams)

three_grams_list[:20]

[WordList(['chapter', 'i', 'the']),
 WordList(['i', 'the', 'family']),
 WordList(['the', 'family', 'of']),
 WordList(['family', 'of', 'dashwood']),
 WordList(['of', 'dashwood', 'had']),
 WordList(['dashwood', 'had', 'long']),
 WordList(['had', 'long', 'been']),
 WordList(['long', 'been', 'settled']),
 WordList(['been', 'settled', 'in']),
 WordList(['settled', 'in', 'sussex']),
 WordList(['in', 'sussex', 'their']),
 WordList(['sussex', 'their', 'estate']),
 WordList(['their', 'estate', 'wa']),
 WordList(['estate', 'wa', 'large']),
 WordList(['wa', 'large', 'and']),
 WordList(['large', 'and', 'their']),
 WordList(['and', 'their', 'residence']),
 WordList(['their', 'residence', 'wa']),
 WordList(['emma', 'woodhouse', 'handsome']),
 WordList(['woodhouse', 'handsome', 'clever'])]

In [59]:
for novel in data:
    bigrams = TextBlob(novel).ngrams(n=2)
    bigram_counter = Counter()
    
    for bigram in bigrams:
        # Join the bigram into a string as it is a WordList object.
        bigram_text = ' '.join(bigram)
        # Update the count.
        bigram_counter.update([bigram_text])

    print(bigram_counter.most_common(10))

[('to be', 436), ('of the', 431), ('in the', 359), ('it was', 281), ('of her', 277), ('to the', 242), ('mrs jennings', 237), ('to her', 231), ('i am', 224), ('she was', 209)]
[('” “', 29), ('miss taylor', 23), ('mr knightley', 13), ('of her', 12), ('mr weston', 12), ('of the', 10), ('to have', 9), ('it was', 9), ('her father', 9), ('she had', 9)]
[('my dear', 8), ('that he', 6), ('mr bennet', 6), ('you must', 5), ('of them', 5), ('it is', 4), ('do not', 4), ('how can', 4), ('will be', 4), ('of the', 3)]


# Controlled Vocabulary

Let's look for what scholars call a "controlled vocabulary" -- a list of words that we know to be meaningful. For right now, let's pretend that we're researching the buildings, landscape, and furniture of nineteenth-century England.  I'm curious about what kinds of spaces are described in Austen, and I'd like to begin by counting them.

In [60]:
controlled_vocab = [
    "garden",
    "room", 
    "estate",
    "manor", 
    "hedge", 
    "residence",
    "park",
    "lane",
    "chair",
    "sofa",
    "settee",
    "bed",
    "bedroom",
    "chaise",
    "table",
    "rug",
    "carpet",
    "candelabra",
    "shed",
    "cottage",
    "fence",
    "turret",
    "castle",
    "palace",
    "hut",
    "dwelling"
]

In [61]:
controlled_words = []


words = data[0].split()

for w in words:
    if w in controlled_vocab:
        controlled_words.append(w)

Counter(controlled_words)

Counter({'estate': 19,
         'residence': 7,
         'park': 51,
         'dwelling': 6,
         'room': 97,
         'cottage': 56,
         'garden': 11,
         'shed': 3,
         'table': 23,
         'manor': 1,
         'chair': 9,
         'bed': 25,
         'lane': 3,
         'chaise': 6,
         'rug': 1,
         'bedroom': 1,
         'sofa': 1})

That's not a very good return.  It also occurs to me that I might not be thinking clearly about all the kinds of furniture, buildings, and other structures that might make up the Georgian landscape.  Fortunately, linguists have compiled many dictionaries that can help us to navigate the semantic universe with greater position.  One of these dictionaries is "Wordnet," the fruit of a long-term research undertaking at Princeton. 

# Expanded Controlled Vocabulary with Wordnet

The 'get_synsets' command in Wordnet unlocks the thesaurus/dictionary in its full potential.  We won't go into the full power of the "synsets," but suffice it to say that Wordnet knows that a "house" when used as a noun can mean a "firm," a "sign of the zodiac," a "family," or a "theater."

In [62]:
from textblob import Word

from textblob.wordnet import NOUN

w1 = Word("house")
w1.synsets
syns = w1.get_synsets(pos=NOUN)
print(syns)


[Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12')]


Likewise, wordnet knows that the word "building" can refer to different kinds of construction (as a noun), but it can also be a verb form used with many different senses.

In [63]:
wn.synsets('building')

[Synset('building.n.01'),
 Synset('construction.n.01'),
 Synset('construction.n.07'),
 Synset('building.n.04'),
 Synset('construct.v.01'),
 Synset('build_up.v.02'),
 Synset('build.v.03'),
 Synset('build.v.04'),
 Synset('build.v.05'),
 Synset('build.v.06'),
 Synset('build.v.07'),
 Synset('build.v.08'),
 Synset('build_up.v.04'),
 Synset('build.v.10')]

A *hyponym* is a word that is a more specific version of another word.  So if we want to know the many different types of houses in the dictionary, we can use wordnet's .hyponyms() command to navigate these lists, and we can generate another controlled vocabulary from them.

In [64]:
synlist = wn.synset('house.n.01').hyponyms()
synlist

[Synset('beach_house.n.01'),
 Synset('boarding_house.n.01'),
 Synset('bungalow.n.01'),
 Synset('cabin.n.02'),
 Synset('chalet.n.01'),
 Synset('chapterhouse.n.02'),
 Synset('country_house.n.01'),
 Synset('detached_house.n.01'),
 Synset('dollhouse.n.01'),
 Synset('duplex_house.n.01'),
 Synset('farmhouse.n.01'),
 Synset('gatehouse.n.01'),
 Synset('guesthouse.n.01'),
 Synset('hacienda.n.02'),
 Synset('lodge.n.04'),
 Synset('lodging_house.n.01'),
 Synset('maisonette.n.02'),
 Synset('mansion.n.02'),
 Synset('ranch_house.n.01'),
 Synset('residence.n.02'),
 Synset('row_house.n.01'),
 Synset('safe_house.n.01'),
 Synset('saltbox.n.01'),
 Synset('sod_house.n.01'),
 Synset('solar_house.n.01'),
 Synset('tract_house.n.01'),
 Synset('villa.n.02')]

Wordnet's 'lemmas()' function gives us access to the base lemma associated with any of these categories.  Let's use the 'append' function and the 'lemmas' function to create a vocabulary list stripped of the Wordnet apparatus.  

In [65]:
new_vocab = []

for syn in synlist:
    for lemma in syn.lemmas():
        new_vocab.append(str(lemma.name()))
        
print(new_vocab)

['beach_house', 'boarding_house', 'boardinghouse', 'bungalow', 'cottage', 'cabin', 'chalet', 'chapterhouse', 'fraternity_house', 'frat_house', 'country_house', 'detached_house', 'single_dwelling', 'dollhouse', "doll's_house", 'duplex_house', 'duplex', 'semidetached_house', 'farmhouse', 'gatehouse', 'guesthouse', 'hacienda', 'lodge', 'hunting_lodge', 'lodging_house', 'rooming_house', 'maisonette', 'maisonnette', 'mansion', 'mansion_house', 'manse', 'hall', 'residence', 'ranch_house', 'residence', 'row_house', 'town_house', 'safe_house', 'saltbox', 'sod_house', 'soddy', 'adobe_house', 'solar_house', 'tract_house', 'villa']


Bear in mind: we don't have to stop here.  We can keep drilling down within each of these catergories to get an even finer-grain list.

In [66]:
for syn in synlist:
    print(syn.lemmas())

[Lemma('beach_house.n.01.beach_house')]
[Lemma('boarding_house.n.01.boarding_house'), Lemma('boarding_house.n.01.boardinghouse')]
[Lemma('bungalow.n.01.bungalow'), Lemma('bungalow.n.01.cottage')]
[Lemma('cabin.n.02.cabin')]
[Lemma('chalet.n.01.chalet')]
[Lemma('chapterhouse.n.02.chapterhouse'), Lemma('chapterhouse.n.02.fraternity_house'), Lemma('chapterhouse.n.02.frat_house')]
[Lemma('country_house.n.01.country_house')]
[Lemma('detached_house.n.01.detached_house'), Lemma('detached_house.n.01.single_dwelling')]
[Lemma('dollhouse.n.01.dollhouse'), Lemma('dollhouse.n.01.doll's_house')]
[Lemma('duplex_house.n.01.duplex_house'), Lemma('duplex_house.n.01.duplex'), Lemma('duplex_house.n.01.semidetached_house')]
[Lemma('farmhouse.n.01.farmhouse')]
[Lemma('gatehouse.n.01.gatehouse')]
[Lemma('guesthouse.n.01.guesthouse')]
[Lemma('hacienda.n.02.hacienda')]
[Lemma('lodge.n.04.lodge'), Lemma('lodge.n.04.hunting_lodge')]
[Lemma('lodging_house.n.01.lodging_house'), Lemma('lodging_house.n.01.rooming_h

In [67]:
finer_syns = []

for syn in synlist:
    hypo = syn.hyponyms()
    for h in hypo:
        finer_syns.append(h)
 #   print(syn.hyponyms())
  
print(finer_syns)

[Synset('bed_and_breakfast.n.01'), Synset('log_cabin.n.01'), Synset('chateau.n.01'), Synset('dacha.n.01'), Synset('shooting_lodge.n.01'), Synset('summer_house.n.01'), Synset('villa.n.03'), Synset('villa.n.04'), Synset('lodge.n.03'), Synset('flophouse.n.01'), Synset('manor.n.01'), Synset('palace.n.01'), Synset('stately_home.n.01'), Synset('court.n.09'), Synset('deanery.n.01'), Synset('manse.n.02'), Synset('palace.n.04'), Synset('parsonage.n.01'), Synset('religious_residence.n.01'), Synset('brownstone.n.02'), Synset('terraced_house.n.01')]


In [68]:
new_vocab_finer = []

for syn in finer_syns:
    for subsyn in syn.lemmas():
          new_vocab_finer.append(str(subsyn.name()))

new_vocab_finer

['bed_and_breakfast',
 'bed-and-breakfast',
 'log_cabin',
 'chateau',
 'dacha',
 'shooting_lodge',
 'shooting_box',
 'summer_house',
 'villa',
 'villa',
 'lodge',
 'flophouse',
 'dosshouse',
 'manor',
 'manor_house',
 'palace',
 'castle',
 'stately_home',
 'court',
 'deanery',
 'manse',
 'palace',
 'parsonage',
 'vicarage',
 'rectory',
 'religious_residence',
 'cloister',
 'brownstone',
 'terraced_house']

In [69]:
controlled_vocab.append(new_vocab_finer)
controlled_vocab.append(new_vocab)
controlled_vocab

['garden',
 'room',
 'estate',
 'manor',
 'hedge',
 'residence',
 'park',
 'lane',
 'chair',
 'sofa',
 'settee',
 'bed',
 'bedroom',
 'chaise',
 'table',
 'rug',
 'carpet',
 'candelabra',
 'shed',
 'cottage',
 'fence',
 'turret',
 'castle',
 'palace',
 'hut',
 'dwelling',
 ['bed_and_breakfast',
  'bed-and-breakfast',
  'log_cabin',
  'chateau',
  'dacha',
  'shooting_lodge',
  'shooting_box',
  'summer_house',
  'villa',
  'villa',
  'lodge',
  'flophouse',
  'dosshouse',
  'manor',
  'manor_house',
  'palace',
  'castle',
  'stately_home',
  'court',
  'deanery',
  'manse',
  'palace',
  'parsonage',
  'vicarage',
  'rectory',
  'religious_residence',
  'cloister',
  'brownstone',
  'terraced_house'],
 ['beach_house',
  'boarding_house',
  'boardinghouse',
  'bungalow',
  'cottage',
  'cabin',
  'chalet',
  'chapterhouse',
  'fraternity_house',
  'frat_house',
  'country_house',
  'detached_house',
  'single_dwelling',
  'dollhouse',
  "doll's_house",
  'duplex_house',
  'duplex',
  '

In [70]:
controlled_words = []

for w in words:
    if w.lower() in controlled_vocab:
        controlled_words.append(w)

Counter(controlled_words)

Counter({'estate': 19,
         'residence': 7,
         'park': 51,
         'dwelling': 6,
         'room': 97,
         'cottage': 56,
         'garden': 11,
         'shed': 3,
         'table': 23,
         'manor': 1,
         'chair': 9,
         'bed': 25,
         'lane': 3,
         'chaise': 6,
         'rug': 1,
         'bedroom': 1,
         'sofa': 1})

# Exercise 

*To be turned in on Canvas*

1) Expand the variable "controlled words" by looping through the words in the original "controlled_vocab" variables, finding their noun hyponyms, and creating a list of lemmas that you can use to search Jane Austen.

2) Next, find the bigrams (two-word phrases) in Jane Austen that contain any of these words.  Sort the phrases by descending frequency, and paste the top twenty in Canvas.

3) Write an interpretive paragraph of at least five sentences making some observations about the build landscape of England at the time of Jane Austen.  Offset phrases and words found in the text with quotation marks.