For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week4-wordnet-controlled-vocab/

# Week 4 Assignment: Working With a Controlled Vocabulary

In an earlier notebook we saw that the phrases "the world" and "the house" were particularly prevalent in Jane Austen's novels. How might we go about investigating how Austen imagined those realms?  One way might be to use a controlled vocabulary of words semantically linked to the landscape and the house to investigate how Austen talks about the surroundings of her characters.

In this notebook, we'll be working with a 'controlled vocabulary,' which is to say, expert-defined words that help to limit our pursuit of wordcount to words that share a certain semantic valence.  Controlled Vocabularies have been used in digital history to examine the history of words used by Victorian people to describe the way that strangers walked down the street, and to show that novelists in the nineteenth century described the urban landscape with increasing detail.  

First, we'll download some novels by Jane Austen to try our vocabulary on.  Then, we'll talk about how to clean the text using stemming and lemmatization.  

Next, we'll use a controlled vocabulary to limit the count to words that are interesting to us.  Then, we'll expand that controlled vocabulary using the 'hyponym' feature of the WordNet package, which consults with dictionaries of the English language organized by linguists at Princeton.  

Finally, we'll visualize our findings.


## Download some Jane Austen Novels

In [2]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [42]:
import nltk, numpy, re, matplotlib



with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()

# combine into a list
data = [sas_data, emma_data, pap_data]


Remember that in our last notebook we decided that we would apply an extra cleaning step to replace all hyphens in the historic text with spaces, so that a search for "boarding house" wouldn't miss "boarding-house".

In [58]:
for novel in data:
    v = novel.replace('-', ' ')

We can now continue witht he normal cleaning steps

In [62]:
# remove whitespace characters
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')

# lowercase and strip punctuation
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression

cleanaustenwords = []

for novel in data:
    words = novel.split()
    for word in words:
        cleanaustenwords.append(word)
        
cleanaustenwords[:20]

['chapter',
 'i',
 'the',
 'family',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settled',
 'in',
 'sussex',
 'their',
 'estate',
 'was',
 'large',
 'and',
 'their',
 'residence',
 'was']

When working with a controlled vocabulary, we don't need to stopword, so we'll skip that step.

# Controlled Vocabulary

Let's look for what scholars call a "controlled vocabulary" -- a list of words that we know to be meaningful. For right now, let's pretend that we're researching the buildings, landscape, and furniture of nineteenth-century England.  I'm curious about what kinds of spaces are described in Austen, and I'd like to begin by counting them.

In [63]:
controlled_vocab = [
    "garden",
    "room", 
    "estate",
    "manor", 
    "hedge", 
    "residence",
    "park",
    "lane",
    "chair",
    "sofa",
    "settee",
    "bed",
    "bedroom",
    "chaise",
    "table",
    "rug",
    "carpet",
    "candelabra",
    "shed",
    "cottage",
    "fence",
    "turret",
    "castle",
    "palace",
    "hut",
    "dwelling"
]

In [64]:
import pandas as pd

controlled_words = []

words = data[0].split()

for w in words:
    if w in controlled_vocab:
        controlled_words.append(w)

pd.Series.value_counts(controlled_words)

room         97
cottage      56
park         51
bed          25
table        23
estate       19
garden       11
chair         9
residence     7
chaise        6
dwelling      6
shed          3
lane          3
bedroom       1
sofa          1
manor         1
rug           1
dtype: int64

That's okay for "room" -- with 97 hits -- but it doesn't seem like my list of vocabulary was very accurate; most of the words I chose have fewer than ten appearances.  

It also occurs to me that I might not be thinking clearly about all the kinds of furniture, buildings, and other structures that might make up the Georgian landscape.  Fortunately, linguists have compiled many dictionaries that can help us to navigate the semantic universe with greater position.  One of these dictionaries is "Wordnet," the fruit of a long-term research undertaking at Princeton. 

# Expanded Controlled Vocabulary with Wordnet

In an earlier notebook, we saw that we could use the Wordnet package to generate a list of semantically-connected words.

In [86]:
from textblob import Word
from nltk.corpus import wordnet as wn

mysynlist = wn.synset('room.n.01')

# get the hyponyms
hyposynlist = mysynlist.hyponyms()

# get the hyponyms of the hyponyms
finer_syns = [] # create an empty list
 
for syn in hyposynlist: # loop through all the synsets
    myhyponyms = syn.hyponyms()
    for h in myhyponyms:
        finer_syns.append(h)
  
# get the lemmas for everything so far
housinglemmas = [] # create an empty list
for syn in hyposynlist:
    for l in syn.lemmas():
        if l.name() not in housinglemmas: # check for uniqueness
            housinglemmas.append(l.name())

for syn in finer_syns:
    for l in syn.lemmas():
         if l.name() not in housinglemmas: # check for uniqueness
            housinglemmas.append(l.name())

# remove ambiguous words
housinglemmas2 = []

for word in housinglemmas:
    if word not in ['hall', 'court']:
        housinglemmas2.append(word)

housinglemmas2[:15]

# replace hyphens and underscores with spaces
cleanwordnetwords = []

for word in housinglemmas2:
    v = word.replace('-', ' ')
    v = v.replace('_', ' ') 
    cleanwordnetwords.append(str(v))
    
cleanwordnetwords

['anechoic chamber',
 'anteroom',
 'antechamber',
 'entrance hall',
 'foyer',
 'lobby',
 'vestibule',
 'back room',
 'ballroom',
 'dance hall',
 'dance palace',
 'barroom',
 'bar',
 'saloon',
 'ginmill',
 'taproom',
 'bathroom',
 'bath',
 'bedroom',
 'sleeping room',
 'sleeping accommodation',
 'chamber',
 'bedchamber',
 'belfry',
 'billiard room',
 'billiard saloon',
 'billiard parlor',
 'billiard parlour',
 'billiard hall',
 'boardroom',
 'council chamber',
 'cardroom',
 'cell',
 'cubicle',
 'jail cell',
 'prison cell',
 'checkroom',
 'left luggage office',
 'classroom',
 'schoolroom',
 'clean room',
 'white room',
 'cloakroom',
 'coatroom',
 'closet',
 'clubroom',
 'compartment',
 'conference room',
 'control room',
 'courtroom',
 'cubby',
 'cubbyhole',
 'snuggery',
 'snug',
 'cutting room',
 'darkroom',
 'den',
 'dinette',
 'dining room',
 'dining room',
 'door',
 'dressing room',
 'durbar',
 'engineering',
 'engine room',
 'floor',
 'trading floor',
 'furnace room',
 'gallery',
 '

Not all of these words will be useful for mining Jane Austen, who is notorious for having few frat houses in her books. But she might have some cottages or castles, and this list is much longer than mine!

## Find the controlled wordnet vocabulary in Jane Austen

Next, let's run a "for" loop similar to those we've seen before to search for the words in *cleanwordnetwords* that also appear in Jane Austen novels.  As you'll recall, we have a master list of Austen words called "cleanaustenwords."

First, let's demonstrate how **not** to do it, using the operator "in."

In [87]:
matchedwords = []

for w in cleanaustenwords:
    for v in cleanwordnetwords: 
        if w in v:
            matchedwords.append(v)

pd.Series.value_counts(matchedwords)[:30]

operating theater          12344
operating theatre          12341
toilet facility            10186
public lavatory             9611
withdrawing room            9064
dormitory room              8782
dormitory                   8685
automobile trunk            8396
torture chamber             8317
living room                 8282
dining room                 8222
waiting area                8208
discotheque                 8110
left luggage office         8064
waiting room                8058
sleeping accommodation      7739
elevator car                7586
lavatory                    7586
narthex                     7480
comfort station             7435
exhibition hall             7319
operating room              7115
laminar flow clean room     6891
exhibition area             6889
storage room                6854
dining hall                 6852
storage locker              6770
scriptorium                 6588
trading floor               6533
stowage                     6397
dtype: int

Hmm, I'm not sure that's right.  There shouldn't be so many solar houses or log cabins in Jane Austen.  

What might have gone wrong?  The problem is 'in.'  By using this code:

        if w in v

-- we counted every word in Jane Austen that has one of the words in cleanwordnetwords inside it.  Thus the word "bed" appears 57 times in Jane Austen, and the phrase bed-and-breakfast appears in cleanwordnetwords.  The computer counted 57 bed-and-breakfasts in Jane Austen. That's just wrong!

You can only catch errors of this kind if you're thinking (a) about the code and (b) about the history.

How can we tweak the code above to be more accurate?  

### Accurately searching for matches of single words with '==' 

The answer involves finding a perfect match. 

Last time, we saw how to search for a match using re.compile() and .match.(). That's one way you could do it.

This time, we'll use another formulation that does the same thing. The formula "if a==b".  This formula latter searches for cases where a is an exact match for b.

In [88]:
matchedwords = []

for w in cleanaustenwords:
    for v in cleanwordnetwords: 
        if w == v: # notice what I changed in this line
            matchedwords.append(v)
matchedwords
pd.Series.value_counts(matchedwords)

can            212
well           210
john           148
door            38
head            36
keep            32
hold            13
parlour         11
convenience      5
study            3
snug             3
library          3
kitchen          2
bath             2
chamber          2
lobby            2
bedroom          1
floor            1
vestibule        1
closet           1
toilet           1
dtype: int64

Some of these words are false positives.  The words "can," "well," "john," "head," "keep," and so forth have too many meanings to tell us about Jane Austen.  I'm also wary of 'toilet,' which in nineteenth-century English probably refers to the practice of grooming rather than to a room in the house.

Let's stopword our list based on our own understanding of which words are meaningful, i.e., unambiguously about our research subject, Jane Austen's world.

In [121]:
ambiguouswords = ['can', 'head', 'well', 'toilet', 'john', 'keep', 'hold', 'study', 'convenience', 'snug', 'floor']

matchedwords = []

for w in cleanaustenwords:
    for v in cleanwordnetwords: 
        if w == v: 
            if v not in ambiguouswords:
                matchedwords.append(v)
matchedwords
pd.Series.value_counts(matchedwords)

door         38
parlour      11
library       3
bath          2
kitchen       2
lobby         2
chamber       2
closet        1
bedroom       1
vestibule     1
dtype: int64

That's much better. We can begin to interpret it.

 * In Jane Austen's world, the 'doors' separating inside from outside have particular meaning. We might speculate that doors suggest privacy, but we'd need in-text mentions to understand why she invokes them so often.
 * The 'parlour' is also important as a meeting place where the family received company.  We'd also want to read a few mentions and perhaps some plot summaries of her novels to understand who is being received in parlours and why there are so many of them.
 * More humble spaces like the bath, closet, and bedroom are mentioned with extreme infrequency.



## Bigrams


We still aren't getting all the information.

The variable *cleanaustenwords* contains strings that are one-word long. 

We're matching it with *cleanwordnetwords* -- which includes such two-word phrases such as "terraced house." 

Those two-word phrases aren't getting accurately matched. 

We need bigrams.






Remember that in an earlier exercise we learned how to find all the bigrams in Jane Austen.



In [100]:
from textblob import TextBlob

bigrams = TextBlob(data[0]).ngrams(n=2)

austenbigramlist = [] # create an empty list which we will fill in with the following loop:

for bigram in bigrams: # move through each line of the *bigrams* list
    bigram2 = bigram[0] + ' ' + bigram[1] # call the first word, a space, and the second word into a new string
    austenbigramlist.append(bigram2) # save the string 
austenbigramlist[:15]

['chapter i',
 'i the',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in',
 'in sussex',
 'sussex their',
 'their estate',
 'estate was',
 'was large']

Let's write a loop to find just the multiword phrases in our wordnet words.  We'll search for the presence of a space (' ').  Then we'll save those words with spaces as the new variable, *cleanwordnetbigrams*.

Just to be sure we're getting everything, we can also add back a version of word that has the space removed, so that our search will look for "bedroom" as well as "bed room." 

In [104]:
cleanwordnetbigrams = []
for vocab in cleanwordnetwords:
    if " " in vocab:
        cleanwordnetbigrams.append(vocab)
        cleanwordnetbigrams.append(vocab.replace(' ', ''))

cleanwordnetbigrams

['anechoic chamber',
 'anechoicchamber',
 'entrance hall',
 'entrancehall',
 'back room',
 'backroom',
 'dance hall',
 'dancehall',
 'dance palace',
 'dancepalace',
 'sleeping room',
 'sleepingroom',
 'sleeping accommodation',
 'sleepingaccommodation',
 'billiard room',
 'billiardroom',
 'billiard saloon',
 'billiardsaloon',
 'billiard parlor',
 'billiardparlor',
 'billiard parlour',
 'billiardparlour',
 'billiard hall',
 'billiardhall',
 'council chamber',
 'councilchamber',
 'jail cell',
 'jailcell',
 'prison cell',
 'prisoncell',
 'left luggage office',
 'leftluggageoffice',
 'clean room',
 'cleanroom',
 'white room',
 'whiteroom',
 'conference room',
 'conferenceroom',
 'control room',
 'controlroom',
 'cutting room',
 'cuttingroom',
 'dining room',
 'diningroom',
 'dining room',
 'diningroom',
 'dressing room',
 'dressingroom',
 'engine room',
 'engineroom',
 'trading floor',
 'tradingfloor',
 'furnace room',
 'furnaceroom',
 'art gallery',
 'artgallery',
 'picture gallery',
 'pic

Here's the loop to match the bigrams in austen -- from *austenbigramlist* -- with the multi-word phrases in cleanwordnetwords -- from _cleanwordnetbigrams_.

In [105]:
matchedbigrams = []

for w in austenbigramlist:
    for v in cleanwordnetbigrams: 
        if w == v: # notice what I changed in this line
            matchedbigrams.append(v)

pd.Series.value_counts(matchedbigrams)

drawing room     5
dining room      4
sitting room     3
billiard room    1
walk in          1
dtype: int64

Again we can begin an interpretation from these findings.

  * Mainly, Austen invokes public spaces of great estates where the sexes would mingle -- the drawing room, dining room, and sitting room are places where both women and men would convene.  
  * The billiard room would be a place mainly for men.
  * Notably, we still see far less of the places reserved for women -- like bed rooms -- or for servants, like cottages or kitchens.  

Our controlled vocabulary search has successfully illuminated the world of Jane Austen.

# Assignment 

*To be turned in on Canvas*

### 1) In this exercise, you will use Wordnet to compile a list of terms that can help us to explore Jane Austen's word.  

Brainstorm a list of possible places where Austen's characters might go. The fireside? The dance hall? A carriage? Town? Do they see rivers? cottages? thatched huts? Do they look out of windows? Or is their world mostly one of dresses, wigs, gowns, and other ornaments? Perhaps they go to church? Perhaps family relationships are more important than places, and cousins, uncles, and daughters are really what's important?

Use your thoughts as the basis for expanding your controlled vocabulary.  In this exercise, we only took the hyponyms from one synset -- room.n.01.  But you could take the hyponyms from another synset.  Alternatively, you could look for synsets that contain the word 'furniture' or 'garden.' You can use the formula Word('garden').synsets to call up the synsets for those words, and you can plug those synsets into our code to find multiple-word formulae.

Once you have a synset that you think is meaningful, run it through the code that follows to produce a new list of matches in Jane Austen.  

What you choose is up to you. Play with Wordnet and the code until you are able to expand from a few queries to a list of words that actually returns meaningful results when matched against the text.

This assignment requires a process of exploration and trial and error.

### 2) Make a data visualization of your findings.

Use the bar plot format to graph the most popular places in Jane Austen according to your research.

You will want to combine the matched bigrams and matched individual words into one dataset before graphing.

You will want to make sure to eliminate ambiguous words and redundancies from your results.

As always, make sure it is well-labled and consistent.

Embed the data visualization into a Word Document.

In [147]:
for synset in Word('park').synsets:
    for lemma in synset.lemmas():
        print(lemma.name())

park
parkland
park
commons
common
green
ballpark
park
Park
Mungo_Park
parking_lot
car_park
park
parking_area
park
park
park


In [150]:
for synset in Word('park').synsets:
    for s in synset.hyponyms():
        for lemma in s.lemmas():
            print(lemma.name())

national_park
safari_park
amusement_park
funfair
pleasure_ground
village_green
used-car_lot
angle-park
double-park
parallel-park


### 2) Write an interpretive paragraph of at least five sentences making some observations about the built landscape of England at the time of Jane Austen. 

Where appropriate, refer to the data visualization you have made as a source of evidence. Talk about as many words in the visualization as you can.

Where appropriate, use the text of Jane Austen's novels to elucidate the meaning of words whose implications are unclear to you.  Look up in-text mentions of the places you found. 

  * Sense and Sensibility: https://www.gutenberg.org/files/161/161-h/161-h.htm
  * Pride and Prejudice: https://www.gutenberg.org/ebooks/1342
  * Emma: https://www.gutenberg.org/files/158/158-h/158-h.htm 
  
Make a well-supported arguments.

Offset phrases and words found in the text with quotation marks. Use footnotes to tell us where each direct quotation is from.


### Help where help is needed

If you're finding yourself confused about the code and how to follow directions at this point, bear in mind that we're moving very quickly through the introduction to Python. You might need to slow down and revisit some of the "optional" notebooks that we mentioned in Weeks 1-2.  Here they are again:

- lists : https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/lists.ipynb
- for loops : https://problemsolvingwithpython.com/09-Loops/09.01-For-Loops/
- expressions and strings :  https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/expressions-and-strings.ipynb
- dictionaries, sets, tuples: https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/dictionaries-sets-tuples.ipynb
- counting things: https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/counting.ipynb

Remember that SMU expects you to be spending around 6 hours every week on your homework for this class.  Don't be afraid to keep tweaking the code until it works -- or reaching out on Slack if you need encouragement from others.  

Also, please bear in mind that everyone who learns how to code ultimately does so through a lot of trial and error.  Try typing in code and running it. When you run into trouble, you can google your problems and find stack overflow results or blog entries that match your problem and suggest solutions.  The more you try, the faster you will master code.  

Don't give up!  Keep trying things until you feel like you're getting it! 

