For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-ngrams-lemmatization-gender

# Week 3 Mini Notebook: Searching Bigrams

Very often in this work we will be interested not only in individual words but in multiple word phrases.  Sometimes multiple word phrases are repeated often, and they hold potent meaning:
    
   * true love
   * dining room
   * law and order

Counting two word phrases can also illuminate hidden bias in large groups of documents. For example, one study proved that plot summaries of twentieth-century American movies tended to assign different verbs to the pronouns "he" and "she":

   * he beats
   * he instructs
   * he challenges
   * he orders
   * he owns
   
 
   * she cries
   * she rejects
   * she accepts
   * she forgives
   * she begs
   
see http://varianceexplained.org/r/tidytext-gender-plots/

Clearly, these two-word phrases suggest some bias in the ways that women and men are portrayed on the American screen.

The practice of finding multiple-word phrases is called finding **"ngrams"**.  A two-word phrase is also called a **"bigram"**.  A three-word phrase is called a **"trigram"**, and so on.  

In general, repeated many-word phrases often hold telling clues about the values of a text.

In this notebook, we will learn how to search for multiple-word phrases.   We will use the **TextBlob** software package and the command **.ngram()** to find these phrases.

## Download some Jane Austen Novels

These steps repeat the instructions for downloading and cleaning text from earlier. 

In [189]:
import nltk, numpy, re, matplotlib# , num2words

In [190]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [191]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()

# combine into a list
data = [sas_data, emma_data, pap_data]

# remove whitespace characters
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')

# split into words
import pandas

for novel in data:
    words = novel.split()

# lowercase and strip punctuation
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression



We now have a list of novels called **data**, which has been stripped of punctuation & lowercased

***IMPORTANTLY*** please note that we did not stopword the data this time.  When we look for grammatical relationships -- like verbs that follow nouns, which we will capture in today's analysis, or meaningful two-word phrases -- stopwords are important.  Stopwords are part of the order of the sentence.  If we stopworded the paragraph below, we would produce a list of two-word phrases that are not accurate for the flow of Austen's language.   

In [193]:
data[0][:500].strip() # .strip() just removes the extra whitespace from the title page and allow the text to display properly

'chapter i   the family of dashwood had long been settled in sussex their estate was large and their residence was at norland park in the centre of their property where for many generations they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance the late owner of this estate was a single man who lived to a very advanced age and who for many years of his life had a constant companion and housekeeper in his s'

If we did stopwords first and then counted n-grams, we wouldn't be getting accurate results. Instead, we'll find the n-grams and then remove n-grams with stopwords in them.

As this course moves forward, we'll discuss less and less the steps for cleaning that we have chosen and the order.  We'll leave it to you to figure out why certain steps go in a certain order. 

You should get into the habit of asking yourself which cleaning steps have been executed, in which order, and what that means for the results. 

In text mining, the analyst has to choose which cleaning steps to use for each analysis in question.  


# Counting Words and N-Grams

### N-Grams

Sometimes we want to look for multi-word phrases instead of individual words.  For example, if we're researching the living spaces of Jane Austen's England, we definitely want to know whether she refers to "dining rooms" or "bed-rooms" (which our punctuation clean-up might have turned into separate words, depending on what we did).

We will use the software package **textblob** and the command TextBlob().  **TextBlob()** takes an object which is a string of text.

   TextBlob(novel)

We will use the command **.ngrams()** which takes the object "n = _," where _ is set as the number of words in a phrase.

Here's how we tell TextBlob to look at the first 100 characters of Sense and Sensibility for 2-word phrases.

In [195]:
from textblob import TextBlob

bigrams = TextBlob(data[0][:100]).ngrams(n=2)

Notice that right now the bigrams are in the proprietary **"WordList"** format.  Wordlist is part of the proprietary software associated with the package TextBlob. It functions much like a **normal list**.

We can convert them into a **normal list** by:
   * calling each part of the list using square brackets (**bigram[0]**, **bigram[1]**)
   * pasting them together using '+' with a space (' ') in the middle 

Here's a loop to clean them up:

In [196]:
bigramlist = []
for bigram in bigrams:
    bigram2 = bigram[0] + ' ' + bigram[1]
    bigramlist.append(bigram2)
bigramlist[:20]

['chapter i',
 'i the',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in',
 'in sussex']

Notice what's happening. If you read through the first word only of the above pairs, it reads just like the first words of the chapter: "Chapter i: The Family of Dashwood had long been settled in Sussex."  But the computer has identified all the two-word pairs in that sentence: "the family," "family of," "of Dashwood."  

That is *exactly* what bigrams are.

Now, let's try the cleaning step we just tried, applied to all the novels, using a loop.

In [197]:
bigramlist = []

for novel in data:
    bigrams = TextBlob(novel).ngrams(n=2)
    for bigram in bigrams:
        bigram2 = bigram[0] + ' ' + bigram[1]
        bigramlist.append(bigram2)

bigramlist[:10]

['chapter i',
 'i the',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in']

Now that we have all the bigrams in one list, we can also count the overall top bigrams

Let's count the most common bigrams with **value_counts()**:

In [198]:
bigramcounts = pd.Series.value_counts(bigramlist)
bigramcounts[:100]

to be            448
of the           444
in the           369
of her           291
it was           290
                ... 
miss dashwood     70
is a              70
out of            69
you have          69
to him            69
Length: 100, dtype: int64

We did it!  Also, our results are boring.  Now is a good time to apply stopwords.  


## Stopwording a bigram list


Why now? Because we have already done the n-gram analysis, and all of our n-grams are grammatically meaningful.

*As a general rule, you must apply grammatical analysis -- like counting n-grams -- before removing words*

#### How to stopword a bigram list

Stopwording a bigram list is a little more tricky than stopwording a single word list, because each bigram is composed of two words.

In [199]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

"to" in stopwords

True

In [200]:
"be" in stopwords

True

In [201]:
"to be" in stopwords

False

However, if we break up "to be" into two words, stopwording works again.

In [202]:
"to be".split()

['to', 'be']

In [204]:
for word in  "to be".split():
    print(word in stopwords)

True
True


Another way of condensing that information is to tweak the for loop above to the formula

     item in list1 for item in list2 

-- which means exactly the same as the loop above.

Then we can use the conditional "all" to see if both statements are true (both 'to' and 'be' are in stopwords) -- or whether "any" statement is true.


In [207]:
all(word in stopwords for word in "to be".split())

True

In [208]:
any(word in stopwords for word in "to be".split())

True

We can write a for loop to apply stopwords to the bigram list.

In [234]:
cleanbigramlist = []
    
for bigram in bigramlist:
    bigram2 = bigram.split()
    if any(word not in stopwords for word in bigram2): # if all the words in the bigram are non-stopwords, then:
        bigram3 = bigram2[0] + ' ' + bigram2[1] # glue the split words back into a phrase with a space in the middle
        cleanbigramlist.append(bigram3) # use double brackets to grab the whole row -- including the index

cleanbigramlist[:20]

['chapter i',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in',
 'in sussex',
 'sussex their',
 'their estate',
 'estate was',
 'was large',
 'large and',
 'their residence',
 'residence was',
 'at norland',
 'norland park',
 'park in']

What are the top stopworded bigrams across Jane Austen?

In [235]:
pd.Series.value_counts(cleanbigramlist)[:20]

mrs jennings       237
could not          167
her sister         145
colonel brandon    132
mrs dashwood       121
her mother         118
sir john           112
she could          108
would not          101
to see             101
would be           100
lady middleton      96
the world           92
the house           88
would have          86
must be             86
every thing         81
could be            77
it would            77
am sure             77
dtype: int64

Question: how would the results of this table look different if you used "all" instead of "any" in the loop above?  Try swapping them out and see.

### Filtering bigrams for a common word

Often we will want to look not at the entire set of bigrams but at bigrams that contain words that are meaningful for a certain kind of analysis, for instance verbs that follow 'she' and verbs that follow 'he.'

What if we only want the bigrams that include the word "she"?

Here is a naive way to do it.

In [229]:
she_bigrams = []

for bigram in bigramlist:
    if "she" in bigram: # notice the space after she.  It
        she_bigrams.append(bigram)
        
pd.Series.value_counts(she_bigrams)[:10]

she was      213
she had      193
that she     122
as she       116
she could    108
and she       96
she would     64
said she      47
she is        46
which she     45
dtype: int64

Unfortunately, the above code won't work for 'he,' because it will pick up other words that contain 'he,' including:
   * 'she' 
   * 'mother' 
   * 'heart' 
   * 'her'

In [230]:
he_bigrams = []

for bigram in bigramlist:
    if "he" in bigram: # notice the space after she.  It
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:30]

of the        444
in the        369
of her        291
to the        244
to her        234
she was       213
she had       193
on the        161
and the       160
at the        160
her sister    145
for the       138
her own       131
he was        126
that she      122
in her        120
her mother    118
by the        117
as she        116
that he       116
he had        113
they were     111
from the      109
she could     108
all the       104
the same      103
of their       97
and she        96
and her        96
with the       92
dtype: int64

### Refining our filter using Regular Expressions

The solution is to use "regular expressions," which are ways of coding the details of language.  We will use the **re** package and the command **.compile()** to make a "regular expression" that helps Python to search for an *exact word* like 'he' rather than the string 'he' as it appears in words like 'her' and 'heart.'


In [238]:
import re

**.compile()** takes one object -- a properly formatted "Regular Expression."

The **re.compile()** command allows you to communicate with Python about such needs as detecting the beginning or end of a word.

In fact, telling Python about the beginning and end of words is the major reason you'll want to use Regular Expressions in this class. We'll search for words this way all the time. 

In [None]:
pattern = re.compile("\\bhe\\b") #  notice the .compile() and the "escapes"+b to signify "word boundary"

Here are the rules you need to know:

   * codes in Regular Expressions begin with two backslashes. This punctuation tells the computer not to take the next letter literally: **"\\\\"**.  This is called an "escape"
   * with "b" for "boundary," --  **"\\\b"** -- the regular exprerssion tells the computer to search for a word "boundary."
   * If you place "\\\b" before and after a word, then send it to re.compile(), that tells Python to search for the word wherever it exists as a freestanding word in a string of text, e.g. " he " or "he." or "He ", but not "her," "heart", or "she."

If you tell the computer to find a "boundary" in this way, it will look for both spaces and for the end of strings.

#### Search for a regular expression in the bigram list

Notice how I use two "\\\b"'s below to tell the computer to look for the word "he" but not "her" or "the." 

Notice the use of the command **.match()**.  

   * .match attaches to 'pattern' -- which is the variable we defined up above as the regular expression for freestanding 'he'. 
       * 'Pattern' here is just a variable name; it could be called anything.
   * .match takes one object: the list where Python is searching for the pattern. 
       * In the case below, Python is searching for the pattern "he" (as a freestanding word) in each "bigram" in "bigramlist."
   

In [239]:
he_bigrams = []

for bigram in bigramlist:
    if pattern.match(bigram): # notice the use of .match()
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:10]

he was        126
he had        113
he is          75
he has         49
he did         37
he could       36
he would       28
he said        23
he should      23
he replied     22
dtype: int64

Because our filter is now refined enough to identify unique words, we can try it on the stopworded *cleanbigramlist* data we made above.

In [237]:
he_bigrams = []

for bigram in cleanbigramlist:
    if pattern.match(bigram): # notice the use of .match()
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:20]

he could        36
he would        28
he said         23
he replied      22
he might        19
he must         18
he came         12
he looked       11
he never         9
he really        8
he left          8
he may           8
he stopped       8
he added         7
he married       7
he seemed        6
he continued     6
he felt          6
he told          6
he went          6
dtype: int64

### Assignment

1) Use regular expressions to search for "she" as a freestanding word in the stopworded list *cleanbigramlist*.  
  * Paste the he list and she list side by side in a Word Document. 
    
  * Write a paragraph of at least three sentences comparing the two lists
    
    
2) Write a for-loop that looks for bigrams, trigrams, up to nine-grams in Jane Austen, by switching out '_' in the the 'n = _' parameters of the .ngrams() command.
    
  * Add a line of .value_counts() to the for-loop.  Count how many multi-word phrases there are for each iteration, i.e.: how many 9-word phrases are there, total? how many 8-word phrases? how many 7-word phrases? etc. 
    
  * Adjust the for-loop to save the results of each count in a list.  HINT: You may need to greate a new dummy variable for this count before the for loop.
    
  * What are the longest phrases that are repeated more than 3 times across Austen's corpus? Paste at least 5 into a table in Word.  
  
  * Write a paragraph of at least three sentences discussing the phrases.
    
3) Use stopwording.  What are the longest 3-word phrases in Jane Austen that don't include stopwords? HINT: inside the for loop, create a section to split up your bigrams and test them individually.

  * Create a table.  Output the results to Word.  Write a paragraph of at least three sentences discussing the phrases.  
      
 

Upload a screenshot of your code and the answer to Canvas.