# Week 3 Assignment: For Loops, Expert Vocabulary, and Revising With a Thesaurus

In this week's assignment, you'll learn how to loop over lists of data.  You'll also start the process of thinking critically about which words matter to you for the purposes of text mining, and how to use a thesaurus and the powers of reason to expand your expert vocabulary and divide it into categories of information. 

## Iterating over lists with for

[based on Lauren Klein's Lists and Loops https://github.com/laurenfklein/emory-qtm340/tree/0c3d0935ecd0a7920e331a8efd78240c49997606/notebooks]

The list comprehension syntax discussed earlier is very powerful: it allows you to succinctly transform one list into another list by thinking in terms of filtering and modification. But sometimes your primary goal isn't to make a new list, but simply to perform a set of operations on an existing list.

Let's say that you want to print every string in a list. Here's a short text:

In [95]:
text = "it was the best of times, it was the worst of times"

We can make a list of all the words in the text by splitting on whitespace:

In [96]:
words = text.split()

Of course, we can see what's in the list simply by evaluating the variable:

In [97]:
words

['it',
 'was',
 'the',
 'best',
 'of',
 'times,',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times']

But let's say that we want to print out each word on a separate line, without any of Python's weird punctuation. In other words, I want the output to look like:


    it
    was
    the
    best
    of
    times,
    it
    was
    the
    worst
    of
    times

But how can this be accomplished? We know that the print() function can display an individual string in this manner:

In [98]:
print("hello")

hello


So what we need, clearly, is a way to call the print() function with every item of the list. We could do this by writing a series of print() statements, one for every item in the list:

In [99]:
print(words[0])
print(words[1])
print(words[2])
print(words[3])
print(words[4])
print(words[5])
print(words[6])
print(words[7])
print(words[8])
print(words[9])
print(words[10])
print(words[11])

it
was
the
best
of
times,
it
was
the
worst
of
times



Nice, but there are some problems with this approach:

- It's kind of verbose---we're doing exactly the same thing multiple times, only with slightly different expressions. Surely there's an easier way to tell the computer to do this?
- It doesn't scale. What if we wrote a program that we want to produce hundreds or thousands of lines. Would we really need to write a print statement for each of those expressions?
- It requires us to know how many items are going to end up in the list to begin with.

Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common, that Python has some built-in syntax to make the task easy: the for loop.

Here's how a for loop looks:

for tempvar in sourcelist:
    statements

The words for and in just have to be there---that's how Python knows it's a for loop. Here's what each of those parts mean.

    tempvar: A name for a variable. Inside of the for loop, this variable will contain the current item of the list.
    sourcelist: This can be any Python expression that evaluates to a list---a variable that contains a list, or a list slice, or even a list literal that you just type right in!
    statements: One or more Python statements. Everything tabbed over underneath the for will be executed once for each item in the list. The statements tabbed over underneath the for line are called the body of the loop.

Here's what the for loop for printing out every item in a list might look like:

In [100]:
for item in words:
    print(item)

it
was
the
best
of
times,
it
was
the
worst
of
times


The variable name item is arbitrary. You can pick whatever variable name you like, as long as you're consistent about using the same variable name in the body of the loop. If you wrote out this loop in a long-hand fashion, it might look like this:


    item = words[0]
    print(item)
    item = words[1]
    print(item)
    item = words[2]
    print(item)
    item = words[3]
    print(item)
    # etc.


    
    it
    was
    the
    best
    
Of course, the body of the loop can have more than one statement, and you can assign values to variables inside the loop:


In [101]:
for item in words:
    yelling = item.upper()
    print(yelling)

IT
WAS
THE
BEST
OF
TIMES,
IT
WAS
THE
WORST
OF
TIMES


You can also include other kinds of nested statements inside the for loop, like if/else:

In [102]:

for item in words:
    if len(item) == 2:
        print(item.upper())
    elif len(item) == 3:
        print("   " + item)
    else:
        print(item)

IT
   was
   the
best
OF
times,
IT
   was
   the
worst
OF
times


This structure is called a "loop" because when Python reaches the end of the statements in the body, it "loops" back to the beginning of the body, and executes the same statements again (this time with the next item in the list).


Python programmers tend to use for loops most often when the problem would otherwise be too tricky or complicated to solve using a list comprehension. It's easy to paraphrase any list comprehension in for loop syntax. For example, this list comprehension, which evaluates to a list of the squares of even integers from 1 to 25:


In [103]:
[x * x for x in range(1, 26) if x % 2 == 0]


[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

You can rewrite this list comprehesion as a for loop by starting out with an empty list, then appending an item to the list inside the loop. The source list remains the same:


In [104]:
result = []
for x in range(1, 26):
    if x % 2 == 0:
        result.append(x * x)
result

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

## Join: Making strings from lists

Once we've created a list of words, it's a common task to want to take that list and "glue" it back together, so it's a single string again, instead of a list. So, for example:

In [105]:
element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
glue = ", and "
glue.join(element_list)

'hydrogen, and helium, and lithium, and beryllium, and boron'

The .join() method needs a "glue" string to the left of it---this is the string that will be placed in between the list elements. In the parentheses to the right, you need to put an expression that evaluates to a list. Very frequently with .join(), programmers don't bother to assign the "glue" string to a variable first, so you end up with code that looks like this:


In [106]:
words = ["this", "is", "a", "test"]
" ".join(words)

'this is a test'


When we're working with .split() and .join(), our workflow usually looks something like this:

    Split a string to get a list of units (usually words).
    Use some of the list operations discussed above to modify or slice the list.
    Join that list back together into a string.
    Do something with that string (e.g., print it out).

With this in mind, here's a program that splits a string into words, randomizes the order of the words, then prints out the results:


In [107]:
# to make this block work:

# add `import random`, the module `shuffle()` belongs to. 

# remove `split()` bc the `shuffle()` method only works on lists, not string objects (and `split()` transforms items to string objects)

# if you want to keep demonstrating `.split()` with `shuffle()` you could transform the str objects to lists, but that step might be hard to follow logically 

# alterantively you could use `sort()` instead of suffle (see below)

import random

text = "it was a dark and stormy night"
# words = text.split() 
random.shuffle(words)
' '.join(words)

'this a is test'

In [108]:
# sort option w str split

text = "it was a dark and stormy night"
words = text.split()
words.sort()
for word in words:
    print(word)

a
and
dark
it
night
stormy
was


EXERCISE: Write a Python command-line program that prints out the lines of a text file in random order.

# Working with a Controlled Vocabulary 

Inspired by tutorials by Paige McKenzie - https://p-mckenzie.github.io/2018/01/11/Jane-Austen/
William Scott - https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

In [109]:
!pip install num2words

Defaulting to user installation because normal site-packages is not writeable


In [110]:
import nltk, numpy, re, matplotlib, num2words

ModuleNotFoundError: No module named 'num2words'

In [None]:
cd /scratch/group/history/hist_3368-jguldi

In [None]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Make sure that your data matches what you think it should.

In [None]:
print(sas_data)

Looks good!

Isn't it getting tired, retyping the same command for each novel? Let's throw them all into one data set so we can loop through them.


In [111]:

data = [sas_data, emma_data, pap_data]

data[0]



There still appear to be some errors where spaces have been replaced by "\n".  We'll get rid of those in a second.

Next, let's split the text into words and print the first word of each.

In [112]:
for d in cleantext: # delete this perhaps? Need Jo's approval
    count = Counter(d)
    print(count)

NameError: name 'cleantext' is not defined

In [139]:
import pandas

for d in data:
    words = d.split()
    print(words[:20]) 


['*', '*', '*', '*', '*', 'CHAPTER', 'I', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex.', 'Their', 'estate', 'was']
['Emma', 'Woodhouse,', 'handsome,', 'clever,', 'and', 'rich,', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition,', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best']
['.', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in']


... and count the individual words

In [114]:
from collections import Counter

for d in data: 
    count = Counter(words)
    print(count)

Counter({'of': 28, 'to': 22, 'you': 22, 'a': 20, 'the': 17, 'and': 17, 'I': 15, 'that': 14, 'is': 11, 'in': 11, 'be': 11, 'his': 11, 'he': 10, 'my': 10, 'Mr.': 9, 'for': 9, 'will': 9, 'as': 8, 'was': 8, 'are': 8, 'must': 7, 'no': 7, 'not': 7, 'so': 6, 'it': 6, 'or': 5, 'such': 5, 'one': 5, 'has': 5, 'she': 5, 'have': 5, 'dear,': 5, 'see': 5, 'visit': 5, 'man': 4, 'on': 4, 'know': 4, 'by': 4, 'with': 4, 'can': 4, 'You': 4, 'them': 4, 'do': 4, 'her': 4, 'good': 3, 'want': 3, 'little': 3, 'may': 3, '"My': 3, 'him': 3, 'Bennet': 3, 'replied': 3, 'had': 3, '"But': 3, 'four': 3, 'much': 3, 'how': 3, 'go,': 3, 'Bingley': 3, 'woman': 3, 'over': 3, 'than': 3, 'your': 3, 'twenty': 3, 'truth': 2, 'single': 2, 'possession': 2, 'some': 2, 'other': 2, 'daughters.': 2, 'dear': 2, 'Bennet,"': 2, 'heard': 2, 'Netherfield': 2, 'at': 2, 'Mrs.': 2, 'Long': 2, 'been': 2, 'all': 2, 'it."': 2, 'taken': 2, 'wife': 2, 'me,': 2, 'young': 2, 'large': 2, 'take': 2, '"Is': 2, 'five': 2, 'thousand': 2, 'thing': 2, 

Now, let's lowercase the text and get rid of punctuation

In [115]:
# we can get rid of the extrawords using a loop
#cleantext = []

#for d in data: # remove this 
for w in words:
    # w = w.lower().strip('\\\\n') # strip out the single backslash -- we use an extra slash to tell Python that this is 
    lower = w.lower().strip('",.;:?([)]_') 
    print(lower)



it
is
a
truth
universally
acknowledged
that
a
single
man
in
possession
of
a
good
fortune
must
be
in
want
of
a
wife
however
little
known
the
feelings
or
views
of
such
a
man
may
be
on
his
first
entering
a
neighbourhood
this
truth
is
so
well
fixed
in
the
minds
of
the
surrounding
families
that
he
is
considered
as
the
rightful
property
of
some
one
or
other
of
their
daughters
my
dear
mr
bennet
said
his
lady
to
him
one
day
have
you
heard
that
netherfield
park
is
let
at
last
mr
bennet
replied
that
he
had
not
but
it
is
returned
she
for
mrs
long
has
just
been
here
and
she
told
me
all
about
it
mr
bennet
made
no
answer
do
not
you
want
to
know
who
has
taken
it
cried
his
wife
impatiently
you
want
to
tell
me
and
i
have
no
objection
to
hearing
it
this
was
invitation
enough
why
my
dear
you
must
know
mrs
long
says
that
netherfield
is
taken
by
a
young
man
of
large
fortune
from
the
north
of
england
that
he
came
down
on
monday
in
a
chaise
and
four
to
see
the
place
and
was
so
much
delighted
with
it
that
he

# Stemming

Stemming is the process of removing suffices, like "ed" or "ing".

We will use another standard NLTK package, PorterStemmer, to do the stemming.



In [116]:
from nltk.stem import PorterStemmer

st = PorterStemmer()

stemmed_list = []

for d in data:
    for w in words:
        w = w.lower().strip('",.;:?([)]_') 
        stemmed = st.stem(w)
        stemmed_list.append(stemmed)
        
        
stemmed_list[:20] # i have changed this so you print just the first words
# printing all the words is actually way more computer intensive than it may seem

['',
 'it',
 'is',
 'a',
 'truth',
 'univers',
 'acknowledg',
 'that',
 'a',
 'singl',
 'man',
 'in',
 'possess',
 'of',
 'a',
 'good',
 'fortun',
 'must',
 'be',
 'in']

As we can see, "universal" becomes "univers" (which means that "universally" will be counted with "universal" and "universe") and "single" becomes "singl" (which means it would be counted with "singled").  But "acknowledged" has been left as it is."

# Lemmatization

Let's pick up another term -- lemmatization -- which is extremely memory intensive, but far more accurate.  

In [117]:
from nltk.corpus import wordnet as wn

wn.morphy('aardwolves')

'aardwolf'

In [118]:
lemma_list = []

for d in data:
    for w in words:
        w = w.lower().strip('",.;:?([)]_') 
        lemma = wn.morphy(w)
        lemma_list.append(lemma)

lemma_list[:20]

[None,
 'it',
 'be',
 'a',
 'truth',
 'universally',
 'acknowledge',
 None,
 'a',
 'single',
 'man',
 'in',
 'possession',
 None,
 'a',
 'good',
 'fortune',
 'must',
 'be',
 'in']

Lemmatization is often a more useful approach than stemming because it leverages an understanding of the word itself to convert the word back to its root word. "Acknowledged"  becomes "acknowledge," and "daughters" becomes "daughter."  

Note some important oddities -- words such as "that" are replaced by "None," so if we count lemmas to graph them we will want to eliminate this noise.  

Stemming and lemmatization are important because they matter for how we count.

In [119]:
#STEPH: CAN WE MAKE IT COUNT THE MOST COMMON STEMMEDs?  IT'S LOSING THE PREVIOUS STEPS 
    # done
# ALSO IT CAN'T BE RIGHT THAT THEY ALL HAVE BENNET AND BINGLEY
    # looping prints the same data set 3 times, resulting in Bennet and bingly being printed 3 times--along with every other tuple


#for d in data: 
count = Counter(stemmed_list)
print(count.most_common(100))

[('you', 93), ('of', 87), ('to', 66), ('a', 63), ('the', 54), ('and', 51), ('i', 51), ('it', 45), ('that', 45), ('is', 42), ('my', 39), ('in', 36), ('mr', 36), ('be', 33), ('hi', 33), ('he', 33), ('for', 33), ('them', 33), ('not', 27), ('will', 27), ('as', 24), ('dear', 24), ('but', 24), ('wa', 24), ('are', 24), ('must', 21), ('so', 21), ('have', 21), ('she', 21), ('no', 21), ('bennet', 18), ('visit', 18), ('her', 18), ('or', 15), ('such', 15), ('one', 15), ('ha', 15), ('me', 15), ('do', 15), ('know', 15), ('see', 15), ('how', 15), ('singl', 12), ('man', 12), ('wife', 12), ('may', 12), ('on', 12), ('daughter', 12), ('him', 12), ('by', 12), ('with', 12), ('what', 12), ('bingley', 12), ('marri', 12), ('year', 12), ('can', 12), ('think', 12), ('come', 12), ('go', 12), ('when', 12), ('they', 12), ('twenti', 12), ('good', 9), ('fortun', 9), ('want', 9), ('littl', 9), ('neighbourhood', 9), ('other', 9), ('repli', 9), ('had', 9), ('all', 9), ('four', 9), ('much', 9), ('thing', 9), ('like', 9)

# Counting N-Grams

Sometimes we want to look for multi-word phrases instead of individual words.  For example, if we're researching the living spaces of Jane Austen's England, we definitely want to know whether she refers to "dining rooms" or "bed-rooms" (which our punctuation clean-up might have turned into separate words, depending on what we did).

In [120]:
#data = "" # why have this? delete it? 

In [121]:
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

lemmatizer = WordNetLemmatizer()
three_grams_list = []

#STEPH: HOW DO I USE LEMMAS NOT THE WHOLE WORD? also, error: "IOPub data rate exceeded."
for lemma in lemma_list:
    if lemma is not None: # remove once alexander removes None values
    #print(TextBlob(d).ngrams(n=3))
        three_grams = TextBlob(lemma).ngrams(n=3)
        three_grams_list.append(three_grams)

three_grams_list[:20]
# the IOPub data error is specific to your mac. You can increase the pub data like this: 
# jupyter notebook --NotebookApp.iopub_data_rate_limit=1.0e10
# But don't do that for this code. 
# reason: you are printing too much data--it will crash any computer. 


[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 []]

In [122]:
for d in data:    
    ngram_counts = Counter(TextBlob(d).ngrams(n=2))
    ngram_counts.most_common(10)

TypeError: unhashable type: 'WordList'

# Controlled Vocabulary

Let's look for what scholars call a "controlled vocabulary" -- a list of words that we know to be meaningful. For right now, let's pretend that we're researching the buildings, landscape, and furniture of nineteenth-century England.  I'm curious about what kinds of spaces are described in Austen, and I'd like to begin by counting them.

In [123]:
controlled_vocab = [
    "garden",
    "room", 
    "estate",
    "manor", 
    "hedge", 
    "residence",
    "park",
    "lane",
    "chair",
    "sofa",
    "settee",
    "bed",
    "bedroom",
    "chaise",
    "table",
    "rug",
    "carpet",
    "candelabra",
    "shed",
    "cottage",
    "fence",
    "turret",
    "castle",
    "palace",
    "hut",
    "dwelling"
]

In [124]:
# This output is incorrect. Garden is mentioned many times. 

controlled_words = []

# for d in data: # delete this
for w in words:
    if w.lower() in controlled_vocab:
        controlled_words.append(w)

Counter(controlled_words) # counter is not working properly here 

Counter({'Park': 3, 'chaise': 3})

That's not a very good return.  It also occurs to me that I might not be thinking clearly about all the kinds of furniture, buildings, and other structures that might make up the Georgian landscape.  Fortunately, linguists have compiled many dictionaries that can help us to navigate the semantic universe with greater position.  One of these dictionaries is "Wordnet," the fruit of a long-term research undertaking at Princeton. 

# Expanded Controlled Vocabulary with Wordnet

The 'get_synsets' command in Wordnet unlocks the thesaurus/dictionary in its full potential.  We won't go into the full power of the "synsets," but suffice it to say that Wordnet knows that a "house" when used as a noun can mean a "firm," a "sign of the zodiac," a "family," or a "theater."

In [143]:
from textblob import Word

from textblob.wordnet import NOUN

w1 = Word("house")
w1.synsets
syns = w1.get_synsets(pos=NOUN)
print(syns)


[Synset('house.n.01'), Synset('firm.n.01'), Synset('house.n.03'), Synset('house.n.04'), Synset('house.n.05'), Synset('house.n.06'), Synset('house.n.07'), Synset('sign_of_the_zodiac.n.01'), Synset('house.n.09'), Synset('family.n.01'), Synset('theater.n.01'), Synset('house.n.12')]


Likewise, wordnet knows that the word "building" can refer to different kinds of construction (as a noun), but it can also be a verb form used with many different senses.

In [144]:
wn.synsets('building')

[Synset('building.n.01'),
 Synset('construction.n.01'),
 Synset('construction.n.07'),
 Synset('building.n.04'),
 Synset('construct.v.01'),
 Synset('build_up.v.02'),
 Synset('build.v.03'),
 Synset('build.v.04'),
 Synset('build.v.05'),
 Synset('build.v.06'),
 Synset('build.v.07'),
 Synset('build.v.08'),
 Synset('build_up.v.04'),
 Synset('build.v.10')]

A *hyponym* is a word that is a more specific version of another word.  So if we want to know the many different types of houses in the dictionary, we can use wordnet's .hyponyms() command to navigate these lists, and we can generate another controlled vocabulary from them.

In [145]:
synlist = wn.synset('house.n.01').hyponyms()
synlist

[Synset('beach_house.n.01'),
 Synset('boarding_house.n.01'),
 Synset('bungalow.n.01'),
 Synset('cabin.n.02'),
 Synset('chalet.n.01'),
 Synset('chapterhouse.n.02'),
 Synset('country_house.n.01'),
 Synset('detached_house.n.01'),
 Synset('dollhouse.n.01'),
 Synset('duplex_house.n.01'),
 Synset('farmhouse.n.01'),
 Synset('gatehouse.n.01'),
 Synset('guesthouse.n.01'),
 Synset('hacienda.n.02'),
 Synset('lodge.n.04'),
 Synset('lodging_house.n.01'),
 Synset('maisonette.n.02'),
 Synset('mansion.n.02'),
 Synset('ranch_house.n.01'),
 Synset('residence.n.02'),
 Synset('row_house.n.01'),
 Synset('safe_house.n.01'),
 Synset('saltbox.n.01'),
 Synset('sod_house.n.01'),
 Synset('solar_house.n.01'),
 Synset('tract_house.n.01'),
 Synset('villa.n.02')]

Wordnet's 'lemmas()' function gives us access to the base lemma associated with any of these categories.  Let's use the 'append' function and the 'lemmas' function to create a vocabulary list stripped of the Wordnet apparatus.  

In [146]:
new_vocab = []

for syn in synlist:
    for lemma in syn.lemmas():
        new_vocab.append(str(lemma.name()))
        
print(new_vocab)

['beach_house', 'boarding_house', 'boardinghouse', 'bungalow', 'cottage', 'cabin', 'chalet', 'chapterhouse', 'fraternity_house', 'frat_house', 'country_house', 'detached_house', 'single_dwelling', 'dollhouse', "doll's_house", 'duplex_house', 'duplex', 'semidetached_house', 'farmhouse', 'gatehouse', 'guesthouse', 'hacienda', 'lodge', 'hunting_lodge', 'lodging_house', 'rooming_house', 'maisonette', 'maisonnette', 'mansion', 'mansion_house', 'manse', 'hall', 'residence', 'ranch_house', 'residence', 'row_house', 'town_house', 'safe_house', 'saltbox', 'sod_house', 'soddy', 'adobe_house', 'solar_house', 'tract_house', 'villa']


Bear in mind: we don't have to stop here.  We can keep drilling down within each of these catergories to get an even finer-grain list.

In [147]:
for syn in synlist:
    print(syn.lemmas())

[Lemma('beach_house.n.01.beach_house')]
[Lemma('boarding_house.n.01.boarding_house'), Lemma('boarding_house.n.01.boardinghouse')]
[Lemma('bungalow.n.01.bungalow'), Lemma('bungalow.n.01.cottage')]
[Lemma('cabin.n.02.cabin')]
[Lemma('chalet.n.01.chalet')]
[Lemma('chapterhouse.n.02.chapterhouse'), Lemma('chapterhouse.n.02.fraternity_house'), Lemma('chapterhouse.n.02.frat_house')]
[Lemma('country_house.n.01.country_house')]
[Lemma('detached_house.n.01.detached_house'), Lemma('detached_house.n.01.single_dwelling')]
[Lemma('dollhouse.n.01.dollhouse'), Lemma('dollhouse.n.01.doll's_house')]
[Lemma('duplex_house.n.01.duplex_house'), Lemma('duplex_house.n.01.duplex'), Lemma('duplex_house.n.01.semidetached_house')]
[Lemma('farmhouse.n.01.farmhouse')]
[Lemma('gatehouse.n.01.gatehouse')]
[Lemma('guesthouse.n.01.guesthouse')]
[Lemma('hacienda.n.02.hacienda')]
[Lemma('lodge.n.04.lodge'), Lemma('lodge.n.04.hunting_lodge')]
[Lemma('lodging_house.n.01.lodging_house'), Lemma('lodging_house.n.01.rooming_h

In [148]:
finer_syns = []

for syn in synlist:
    hypo = syn.hyponyms()
    for h in hypo:
        finer_syns.append(h)
 #   print(syn.hyponyms())
  
print(finer_syns)

[Synset('bed_and_breakfast.n.01'), Synset('log_cabin.n.01'), Synset('chateau.n.01'), Synset('dacha.n.01'), Synset('shooting_lodge.n.01'), Synset('summer_house.n.01'), Synset('villa.n.03'), Synset('villa.n.04'), Synset('lodge.n.03'), Synset('flophouse.n.01'), Synset('manor.n.01'), Synset('palace.n.01'), Synset('stately_home.n.01'), Synset('court.n.09'), Synset('deanery.n.01'), Synset('manse.n.02'), Synset('palace.n.04'), Synset('parsonage.n.01'), Synset('religious_residence.n.01'), Synset('brownstone.n.02'), Synset('terraced_house.n.01')]


In [149]:
new_vocab_finer = []

for syn in finer_syns:
    for subsyn in syn.lemmas():
          new_vocab_finer.append(str(subsyn.name()))

new_vocab_finer

['bed_and_breakfast',
 'bed-and-breakfast',
 'log_cabin',
 'chateau',
 'dacha',
 'shooting_lodge',
 'shooting_box',
 'summer_house',
 'villa',
 'villa',
 'lodge',
 'flophouse',
 'dosshouse',
 'manor',
 'manor_house',
 'palace',
 'castle',
 'stately_home',
 'court',
 'deanery',
 'manse',
 'palace',
 'parsonage',
 'vicarage',
 'rectory',
 'religious_residence',
 'cloister',
 'brownstone',
 'terraced_house']

In [150]:
controlled_vocab.append(new_vocab_finer)
controlled_vocab.append(new_vocab)
controlled_vocab

['garden',
 'room',
 'estate',
 'manor',
 'hedge',
 'residence',
 'park',
 'lane',
 'chair',
 'sofa',
 'settee',
 'bed',
 'bedroom',
 'chaise',
 'table',
 'rug',
 'carpet',
 'candelabra',
 'shed',
 'cottage',
 'fence',
 'turret',
 'castle',
 'palace',
 'hut',
 'dwelling',
 ['bed_and_breakfast',
  'bed-and-breakfast',
  'log_cabin',
  'chateau',
  'dacha',
  'shooting_lodge',
  'shooting_box',
  'summer_house',
  'villa',
  'villa',
  'lodge',
  'flophouse',
  'dosshouse',
  'manor',
  'manor_house',
  'palace',
  'castle',
  'stately_home',
  'court',
  'deanery',
  'manse',
  'palace',
  'parsonage',
  'vicarage',
  'rectory',
  'religious_residence',
  'cloister',
  'brownstone',
  'terraced_house'],
 ['beach_house',
  'boarding_house',
  'boardinghouse',
  'bungalow',
  'cottage',
  'cabin',
  'chalet',
  'chapterhouse',
  'fraternity_house',
  'frat_house',
  'country_house',
  'detached_house',
  'single_dwelling',
  'dollhouse',
  "doll's_house",
  'duplex_house',
  'duplex',
  '

In [151]:
# STEPH, THIS SHOULD BE MORE IMPRESSIVE!
controlled_words = []

#for d in data:
for w in words:
    if w.lower() in controlled_vocab:
        controlled_words.append(w)

Counter(controlled_words)

Counter({'Park': 1, 'chaise': 1})

# Exercise 

*To be turned in on Canvas*

1) Expand the variable "controlled words" by looping through the words in the original "controlled_vocab" variables, finding their noun hyponyms, and creating a list of lemmas that you can use to search Jane Austen.

2) Next, find the bigrams (two-word phrases) in Jane Austen that contain any of these words.  Sort the phrases by descending frequency, and paste the top twenty in Canvas.

3) Write an interpretive paragraph of at least five sentences making some observations about the build landscape of England at the time of Jane Austen.  Offset phrases and words found in the text with quotation marks.