For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-ngrams-lemmatization-gender

# Week 3 Mini Notebook: Searching Bigrams

Very often in this work we will be interested not only in individual words but in multiple word phrases.  Sometimes multiple word phrases are repeated often, and they hold potent meaning:
    
   * true love
   * dining room
   * law and order

Counting two word phrases can also illuminate hidden bias in large groups of documents. For example, one study proved that plot summaries of twentieth-century American movies tended to assign different verbs to the pronouns "he" and "she":

   * he beats
   * he instructs
   * he challenges
   * he orders
   * he owns
   
 
   * she cries
   * she rejects
   * she accepts
   * she forgives
   * she begs
   
see http://varianceexplained.org/r/tidytext-gender-plots/

Clearly, these two-word phrases suggest some bias in the ways that women and men are portrayed on the American screen.

The practice of finding multiple-word phrases is called finding **"ngrams"**.  A two-word phrase is also called a **"bigram"**.  A three-word phrase is called a **"trigram"**, and so on.  

In general, repeated many-word phrases often hold telling clues about the values of a text.

In this notebook, we will learn how to search for multiple-word phrases.   We will use the **TextBlob** software package and the command **.ngram()** to find these phrases.

## Download some Jane Austen Novels

These steps repeat the instructions for downloading and cleaning text from earlier. 

In [3]:
import nltk, numpy, re, matplotlib# , num2words

In [4]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [5]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()

# combine into a list
data = [sas_data, emma_data, pap_data]

# remove whitespace characters
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')

# split into words
import pandas

for novel in data:
    words = novel.split()

# lowercase and strip punctuation
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression



We now have a list of novels called **data**, which has been stripped of punctuation & lowercased

In [6]:
data[0][:500].strip() # .strip() just removes the extra whitespace from the title page and allow the text to display properly

'chapter i   the family of dashwood had long been settled in sussex their estate was large and their residence was at norland park in the centre of their property where for many generations they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance the late owner of this estate was a single man who lived to a very advanced age and who for many years of his life had a constant companion and housekeeper in his s'

***IMPORTANTLY*** please note that we did not stopword the data this time. 

### The Order of Cleaning Operations Matters

The order of cleaning operations matters when grammatical structures -- like those captured by bigrams -- are in play.

It might not matter whether you stopword or strip punctuation first. But as soon as you introduce grammatical structures -- like two-word phrases -- you need to work with sentences that faithfully reproduce the flow of the original text.  That is, you need to work with sentences that haven't yet been stopworded. 

When we look for grammatical relationships -- like verbs that follow nouns, which we will capture in today's analysis, or meaningful two-word phrases -- **stopwords are important** -- at least for creating the original index of two-word phrases.

Stopwords are part of the order of the sentence. 


*You should get into the habit of asking yourself which cleaning steps have been executed, in which order, and what that means for the results.*

***In text mining, the analyst has to choose which cleaning steps to use for each analysis in question.***  

As this course moves forward, we'll discuss less and less the steps for cleaning that we have chosen and the order.  We'll leave it to you to figure out why certain steps go in a certain order. 


# Counting Words and N-Grams

### N-Grams

Sometimes we want to look for multi-word phrases instead of individual words.  For example, if we're researching the living spaces of Jane Austen's England, we definitely want to know whether she refers to "dining rooms" or "bed-rooms" (which our punctuation clean-up might have turned into separate words, depending on what we did).

We will use the software package **textblob** and the command TextBlob().  **TextBlob()** takes an object which is a string of text.

   TextBlob(novel)

We will use the command **.ngrams()** which takes the object "n = _," where _ is set as the number of words in a phrase.

Here's how we tell TextBlob to look at the first 100 characters of Sense and Sensibility for 2-word phrases.

In [8]:
from textblob import TextBlob

bigrams = TextBlob(data[0][:100]).ngrams(n=2)
bigrams[:10]

[WordList(['chapter', 'i']),
 WordList(['i', 'the']),
 WordList(['the', 'family']),
 WordList(['family', 'of']),
 WordList(['of', 'dashwood']),
 WordList(['dashwood', 'had']),
 WordList(['had', 'long']),
 WordList(['long', 'been']),
 WordList(['been', 'settled']),
 WordList(['settled', 'in'])]

Notice what's happening. If you read through the first word only of the above pairs, it reads just like the first words of the chapter: "Chapter i: The Family of Dashwood had long been settled in Sussex."  But the computer has identified all the two-word pairs in that sentence: "the family," "family of," "of Dashwood."  

That is *exactly* what bigrams are.

Here's how we look for three-word phrases.

In [13]:
trigrams = TextBlob(data[0][:100]).ngrams(n=3)
trigrams[:5]

[WordList(['chapter', 'i', 'the']),
 WordList(['i', 'the', 'family']),
 WordList(['the', 'family', 'of']),
 WordList(['family', 'of', 'dashwood']),
 WordList(['of', 'dashwood', 'had'])]

Here's how we look for four-word phrases.

In [62]:
fourgrams = TextBlob(data[0][:100]).ngrams(n=4)
fourgrams[:5]

[WordList(['chapter', 'i', 'the', 'family']),
 WordList(['i', 'the', 'family', 'of']),
 WordList(['the', 'family', 'of', 'dashwood']),
 WordList(['family', 'of', 'dashwood', 'had']),
 WordList(['of', 'dashwood', 'had', 'long'])]

How would you like for five-word phrases?

### Converting WordList N-grams to a Normal List

When we call TextBlob().ngrams(), the bigrams that result are in the form of a list, each member of which is a list in the proprietary **"WordList"** format.  

Wordlist is part of the proprietary software associated with the package TextBlob. It functions much like a **normal list** of lists, where each WordList is a list of words.

Often, however, we will want bigrams to appear as a string: 'the family' rather than ['the', 'family']. It's good to be able to convert a list of WordLists of individual words into a simple list of strings.  

We can convert them into a **normal list** of **strings** by:
   * calling each part of the list using square brackets (**bigram[0]**, **bigram[1]**)
   * pasting them together using '+' with a space (' ') in the middle 

Here's a loop to clean up just the bigrams generated from *Sense and Sensibility* above -- the WordList data *bigrams*:

In [64]:
bigramlist = [] # create an empty list which we will fill in with the following loop:

for bigram in bigrams: # move through each line of the *bigrams* list
    bigram2 = bigram[0] + ' ' + bigram[1] # call the first word, a space, and the second word into a new string
    bigramlist.append(bigram2) # save the string 
bigramlist[:5]

['it is', 'is a', 'a truth', 'truth universally', 'universally acknowledged']

### Generating Bigrams and Cleaning Them At the Same Time

Now, let's try generating bigrams for all the novels, and cleaning them, using a nested loop.  

In [65]:
bigramlist = [] # make an empty list

for novel in data: # cycle through each novel
    bigrams = TextBlob(novel).ngrams(n=2) # get the bigrams from the novel
    for bigram in bigrams: # for each bigram in the WordList bigram list
        bigram2 = bigram[0] + ' ' + bigram[1] # take the first word, add a space, and then the second word, making a new string
        bigramlist.append(bigram2) # save the string to the list, *bigramlist*

bigramlist[:5]

['chapter i', 'i the', 'the family', 'family of', 'of dashwood']

We have done the same thing as we did above, but now *bigramlist* contains all the data for all three novels. 

We've also consolidated the code from the previous two sections into one block of code.

### Count the clean bigrams

Now that we have all the bigrams in one list, we can also count the overall top bigrams

Let's count the most common bigrams with **value_counts()**:

In [66]:
import pandas as pd
bigramcounts = pd.Series.value_counts(bigramlist)
bigramcounts[:10]

to be           448
of the          444
in the          369
of her          291
it was          290
to the          244
mrs jennings    237
to her          234
i am            232
she was         213
dtype: int64

We did it!  Also, our results are boring.  Now is a good time to apply stopwords.  


## Stopwording a bigram list

Above, we told you that is was important not to stopword at first when you begin generating n-grams, because we want to preserve grammatical structure. 

However, now is a good time to apply stopwords to find more meaningful results.

*Why can we apply stopwords now?* Because we have already done the n-gram analysis, and all of our n-grams are grammatically meaningful.

*As a general rule, you must apply grammatical analysis -- like counting n-grams -- before removing words*

### How to stopword an individual bigram

Stopwording a bigram list is a little more tricky than stopwording a single word list, because each bigram is composed of two words.

In [23]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

"to" in stopwords

True

In [24]:
"be" in stopwords

True

In [25]:
"to be" in stopwords

False

However, if we break up "to be" into two words, stopwording works again.

In [26]:
"to be".split()

['to', 'be']

In [27]:
for word in  "to be".split():
    print(word in stopwords)

True
True


Another way of condensing that information is to tweak the for loop above to the formula

     item in list1 for item in list2 

-- which means exactly the same as the loop above.  Sometimes in Python we will want to condense for-loops to *item in list for item in list* structures.  It's not terribly important that you're able to generate this grammar, but you should embrace the general notion that the item-for-item grammar can be translated into a for-loop.

We the *item for item* grammar is useful because we can use it alongside conditional statements.  

   * We can use the conditional "all" to see if both statements are true (that is, both 'to' and 'be' are in stopwords)
   * We can use the conditional "any" to see whether *either* 'to' or 'be' are stopwords.  
   
As we shall see below, these two tests will produce subtle but important differences in the data we generate.

In [28]:
all(word in stopwords for word in "to be".split())

True

In [29]:
any(word in stopwords for word in "to be".split())

True

In [67]:
any(word in stopwords for word in "the girl".split())

True

In [68]:
all(word in stopwords for word in "the girl".split())

False

In [69]:
any(word in stopwords for word in "mrs jennings".split())

False

In [70]:
all(word in stopwords for word in "mrs jennings".split())

False

### Stopwording a bigram list

Now that we understand how to stopword an individual bigram, we can apply it to an entire list of bigrams.

We can write a for loop to apply stopwords to the bigram list. Notice that below we are using the "any" command, which will retain all bigrams that include no stopwords or just one stopword, while filtering out bigrams that are composed of two stopwords.

There is no one rule of practice about how to stopword bigram lists.  But different filtering practices get different results.  Your responsibility as an analyst is to think critically about your results and to try on different approaches to explore the data and expand your itnerpretations.

In [77]:
anycleanbigramlist = []
    
for bigram in bigramlist: # cycle through each bigram in the bigram list
    bigram2 = bigram.split() # split the bigram into two words
    if any(word not in stopwords for word in bigram2): # if all the words in the bigram are non-stopwords, then:
        bigram3 = bigram2[0] + ' ' + bigram2[1] # glue the split words back into a phrase with a space in the middle
        anycleanbigramlist.append(bigram3) # use double brackets to grab the whole row -- including the index

pd.Series.value_counts(anycleanbigramlist)[:30] # show just the top 30

mrs jennings       237
could not          167
her sister         145
colonel brandon    132
mrs dashwood       121
her mother         118
sir john           112
she could          108
to see             101
would not          101
would be           100
lady middleton      96
the world           92
the house           88
must be             86
would have          86
every thing         81
it would            77
you know            77
could be            77
am sure             77
mrs ferrars         76
to make             76
my dear             75
marianne 's         74
miss dashwood       70
so much             68
your sister         67
i shall             65
and elinor          65
dtype: int64

#### Interpreting the results 

In the list of bigrams that includes up to one stopword, we see in this list the importance of female relationships in Austen: "her sister," "her mother," and "your sister" are among the most common two-word phrases.  We also see her invoking female possibility -- "she could" -- on a regular basis. 

It is likely female characters who declare, "I shall," although in an essay we would have to check our assertion by cross-referencing the 65 instances of "I shall" with in-text mentions to figure out if women are really saying "I shall" most frequently.

It's interesting that Austen uses phrases like "the world" and "the house" to define the sphere of action in her novels. I don't have much to say about those phrases on their own, but they might lead us to ask about how Austen talks about houses and worlds -- a problem we will pursue in a later prob lem set.  

It's also interesting that she uses phrases such as "she could" and "it would" or "could be" to talk about worlds that *might* happen.  Again, in an essay we'd want to check out multiple examples of how those phrases were used in the primary source to make sure that our analysis is based on fact, not speculation.

#### Investigating the data by changing the search parameters.

Question: how would the results of this table look different if you used "all" instead of "any" in the loop above?  Try swapping them out and see.

In [39]:
allcleanbigramlist = []
    
for bigram in bigramlist: # cycle through each bigram in the bigram list
    bigram2 = bigram.split() # split the bigram into two words
    if all(word not in stopwords for word in bigram2): # if all the words in the bigram are non-stopwords, then:
        bigram3 = bigram2[0] + ' ' + bigram2[1] # glue the split words back into a phrase with a space in the middle
        allcleanbigramlist.append(bigram3) # use double brackets to grab the whole row -- including the index

pd.Series.value_counts(allcleanbigramlist)[:30]

mrs jennings       237
colonel brandon    132
mrs dashwood       121
sir john           112
lady middleton      96
every thing         81
mrs ferrars         76
marianne 's         74
miss dashwood       70
elinor 's           62
said elinor         61
sister 's           54
mother 's           46
edward 's           43
every body          40
mr willoughby       39
mrs palmer          38
john dashwood       37
mr palmer           37
jennings 's         36
dare say            36
willoughby 's       35
mr ferrars          31
every day           30
said mrs            30
thousand pounds     30
said marianne       29
” “                 29
miss steeles        28
miss steele         28
dtype: int64

#### Interpreting the results of a slightly different iteration

In the list of bigrams that include no stopwords at all, we no longer see the female relationships. The top two-word phrases are primarily characters -- Mrs. Jennings, Colonel Brandon, Mrs. Dashwood, Sir John, Lady Middleton -- who are referred to most frequently in the novels.  

We see evidence that female characters talking is an important feature of Austen's novels in phrases such as "said Elinor" or "said Marianne." In an interpretive essay, we might want to look up Elinor and Marianne as characters -- perhaps just via a Wikipedia summary -- to try to understand why their speech is so central to the plot.

Some phrases are more important than others.  "Every day," "every day", and "every body" are more difficult to interpret.  They may indicate that Austen's characters routinely generalize about an unchanging world, where everyone is expected to act the same. Or they may just be features of ordinary speech. We would want to look at in-text mentions to be sure.

We see at least one indication that money is important.  The phrase "thousand pounds" comes up 30 times. In Austen's novels, the economic dependence of women on dowries and husbands is a key plot point. Are there other top phrases that suggest how central money is to the drama?  Try expanding the number of phrases shown from 30 to 100 and see what you think (HINT: you may have to use a for-loop to print the 100 top phrases).

**Summing Up**
*It is important for you as an analyst to be able to talk about how slightly different approaches to the data produce slightly different interpretative results.  Your ability to highlight how you manipulated the data, and how each new manipulation of the data produces a new interpretation, is one of the most important skills you can learn in text mining.*

### Filtering bigrams for a common word

Often we will want to look not at the entire set of bigrams but at bigrams that contain words that are meaningful for a certain kind of analysis, for instance verbs that follow 'she' and verbs that follow 'he.'

What if we only want the bigrams that include the word "she"?

Here is a naive way to do it.

In [78]:
she_bigrams = []

for bigram in bigramlist:
    if "she" in bigram: # notice the space after she.  It
        she_bigrams.append(bigram)
        
pd.Series.value_counts(she_bigrams)[:10]

she was      213
she had      193
that she     122
as she       116
she could    108
and she       96
she would     64
said she      47
she is        46
which she     45
dtype: int64

Let's try it for 'he':

In [79]:
he_bigrams = []

for bigram in bigramlist:
    if "he" in bigram: # notice the space after she.  It
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:10]

of the     444
in the     369
of her     291
to the     244
to her     234
she was    213
she had    193
on the     161
at the     160
and the    160
dtype: int64

Unfortunately, the above code won't work for 'he,' because it will pick up other words that contain 'he,' including:
   * 'she' 
   * 'mother' 
   * 'heart' 
   * 'her'
   * 'the'

### Refining our filter using Regular Expressions

The solution is to use "regular expressions," which are ways of coding the details of language.  We will use the **re** package and the command **.compile()** to make a "regular expression" that helps Python to search for an *exact word* like 'he' rather than the string 'he' as it appears in words like 'her' and 'heart.'


In [80]:
import re

**.compile()** takes one object -- a properly formatted "Regular Expression."

The **re.compile()** command allows you to communicate with Python about such needs as detecting the beginning or end of a word.

In fact, telling Python about the beginning and end of words is the major reason you'll want to use Regular Expressions in this class. We'll search for words this way all the time. 

In [81]:
pattern = re.compile("\\bhe\\b") #  notice the .compile() and the "escapes"+b to signify "word boundary"

Here are the rules you need to know:

   * codes in Regular Expressions begin with two backslashes. This punctuation tells the computer not to take the next letter literally: **"\\\\"**.  This is called an "escape"
   * with "b" for "boundary," --  **"\\\b"** -- the regular exprerssion tells the computer to search for a word "boundary."
   * If you place "\\\b" before and after a word, then send it to re.compile(), that tells Python to search for the word wherever it exists as a freestanding word in a string of text, e.g. " he " or "he." or "He ", but not "her," "heart", or "she."

If you tell the computer to find a "boundary" in this way, it will look for both spaces and for the end of strings.

#### Evaluating a regular expression using .match()

Notice how I use two "\\\b"'s below to tell the computer to look for the word "he" but not "her" or "the." 

Notice the use of the command **.match()**.  

   * .match attaches to 'pattern' -- which is the variable we defined up above as the regular expression for freestanding 'he'. 
       * 'Pattern' here is just a variable name; it could be called anything.
   * .match takes one object: the list where Python is searching for the pattern. 
       * In the case below, Python is searching for the pattern "he" (as a freestanding word) in each "bigram" in "bigramlist."

The output of .match() is True or False.  We can use that output with conditionals (such as 'if') to tell Python how to behave.
   

Let's test if .match() applied to the string 'pattern' matches the string 'she':

In [82]:
if pattern.match("she"):
    print("true")
else:
    print("false")

false


Let's test if .match() applied to the string 'pattern' matches the string 'he':

In [83]:
if pattern.match("he"):
    print("true")
else:
    print("false")

true


Let's test if .match() applied to the string 'pattern' matches the string 'he said':

In [86]:
if pattern.match("he said"):
    print("true")
else:
    print("false")

true


### Search for a regular expression in a bigram list

Let's write a loop to search for the pattern 'he' with word boundaries coded by regular expression.

Recall that we defined the variable *pattern* above using re.compile() and the symbol **\\\b** for word boundaries.

In [57]:
he_bigrams = [] # an empty list for saving certain data only

for bigram in bigramlist: # cycle through each bigram in the list
    if pattern.match(bigram): # is a freestanding 'he' in this bigram?
        he_bigrams.append(bigram) # if so, save it.  
        
pd.Series.value_counts(he_bigrams)[:10]

he was        126
he had        113
he is          75
he has         49
he did         37
he could       36
he would       28
he said        23
he should      23
he replied     22
dtype: int64

Because our filter is now refined enough to identify unique words, we can try it on the stopworded *cleanbigramlist* data we made above.

In [60]:
he_bigrams = []

for bigram in cleanbigramlist:
    if pattern.match(bigram): # notice the use of .match()
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:20]

he could        36
he would        28
he said         23
he replied      22
he might        19
he must         18
he came         12
he looked       11
he never         9
he really        8
he may           8
he left          8
he stopped       8
he added         7
he married       7
he went          6
he told          6
he felt          6
he seemed        6
he continued     6
dtype: int64

This list gives us many verbs that appear with the masculine pronoun in Jane Austen's novels.  

Notably, this list doesn't look much like the list of verbs for men from the study of American movies that we saw above.  Men aren't 'leading' or "beating" or "challenging."  What are the men doing in Austen?

   * They have conversations: prominent phrases include: "he said," "he replied," "he told," "he added," and "he stopped".  
       * The last of these phrases, "he stopped," is ambiguous. We'd have to look at in-text mentions to tell whether men are "stopping by" to say hello, or stopping in the  midst of conversation.
   * They are able to do things -- perhaps in distinction to what women can do, although we'd have to test this hypothesis by studying in-text mentions: "he could," "he would," "he might," "he may"
   * They are involved in relationships, which have an emotional element: "he married," "he felt" 
   * Perhaps, they have obligations: "he must" suggests that, although the phrase is ambiguous. We'd have to look up in-text mentions to figure out if people are telling others that they "must" do something, or if men feel that they "must" fulfill their duty, or if observers anticipate that men are likely to do something in the future.
   * Other people speculate about men: "he seemed."  Again, we'd want to look this up for more detail.
   
What happens if you expand the number of phrases we're looking at from 20 to 40?

### Assignment


1) Write a piece of code that looks for bigrams, trigrams, up to nine-grams in Jane Austen.  You will do this most easily by creating a for-loop where each iteration of the loop changes the '_' in the the 'n = _' parameters of the .ngrams() command from 2 to 3 to 4 to 5 etc.
    
  * Add a line using *len()* to the for-loop to find out how long is the list of n-grams generated.  Count how many total multi-word phrases there are for each iteration, i.e.: how many 9-word phrases are there, total? how many 8-word phrases? how many 7-word phrases? etc. 
    
  * Adjust the for-loop to save the results of each count in a list.  HINT: You may need to greate a new dummy variable for this count before the for loop.
    
  * Use value_counts() to count how many times each multi-word phrase appears.  What are the longest phrases that are repeated more than 3 times across Austen's corpus? Paste at least 5 into a table in Word.  
  
  * Write a paragraph of at least three sentences interpreting Austen's most frequent long phrases.
    
2) Use stopwording.  What are the longest 3-word phrases in Jane Austen that don't include stopwords? HINT: inside the for loop, create a section to split up your bigrams and test them individually.

  * Create a table.  Output the results to Word.  Write a paragraph of at least three sentences discussing the phrases.  
      
 

Upload a screenshot of your code and the answer to Canvas.