For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-ngrams-lemmatization-gender

# Week 3 Mini Notebook: Searching Bigrams

Very often in this work we will be interested not only in individual words but in multiple word phrases.  Sometimes multiple word phrases are repeated often, and they hold potent meaning:
    
   * true love
   * dining room
   * bitter justice

Counting two word phrases can also illuminate hidden bias in large groups of documents. For example, one study proved that plot summaries of twentieth-century American movies tended to assign different verbs to the pronouns "he" and "she":

   * he beats
   * he instructs
   * he challenges
   * he orders
   * he owns
   * she cries
   * she rejects
   * she accepts
   * she forgives
   * she begs
   
see http://varianceexplained.org/r/tidytext-gender-plots/

Clearly, these two-word phrases suggest some bias in the ways that women and men are portrayed on the American screen.

In this notebook, we will learn how to search for multiple-word phrases.   We will use the **TextBlob** software package and the command **.ngram()** to find these phrases.

## Download some Jane Austen Novels

These steps repeat the instructions for downloading and cleaning text from earlier. 

In [2]:
import nltk, numpy, re, matplotlib# , num2words

In [3]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [103]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()

# combine into a list
data = [sas_data, emma_data, pap_data]

# remove whitespace characters
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')

# split into words
import pandas

for novel in data:
    words = novel.split()

# lowercase and strip punctuation
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression



We now have a list of novels called **data**, which has been stripped of punctuation & lowercased

*IMPORTANTLY* please note that we did not stopword the data this time.  When we look for grammatical relationships -- like verbs that follow nouns, or meaningful two-word phrases -- many stopwords are important.  

The analyst has to choose which cleaning steps to use for each analysis in question.  

In [104]:
data[0][:500]

'                                 chapter i   the family of dashwood had long been settled in sussex their estate was large and their residence was at norland park in the centre of their property where for many generations they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance the late owner of this estate was a single man who lived to a very advanced age and who for many years of his life had a constant companion and housekeeper in his s'

# Counting Words and N-Grams

### N-Grams

Sometimes we want to look for multi-word phrases instead of individual words.  For example, if we're researching the living spaces of Jane Austen's England, we definitely want to know whether she refers to "dining rooms" or "bed-rooms" (which our punctuation clean-up might have turned into separate words, depending on what we did).

We will use the software package **textblob** and the command TextBlob().  **TextBlob()** takes an object which is a string of text.

   TextBlob(novel)

We will use the command **.ngrams()** which takes the object "n = _," where _ is set as the number of words in a phrase.

Here's how we tell TextBlob to look at the first 100 characters of Sense and Sensibility for 2-word phrases.

In [113]:
from textblob import TextBlob

bigrams = TextBlob(data[0][:100]).ngrams(n=2)
bigrams

[WordList(['chapter', 'i']),
 WordList(['i', 'the']),
 WordList(['the', 'family']),
 WordList(['family', 'of']),
 WordList(['of', 'dashwood']),
 WordList(['dashwood', 'had']),
 WordList(['had', 'long']),
 WordList(['long', 'been']),
 WordList(['been', 'settled']),
 WordList(['settled', 'in']),
 WordList(['in', 'sussex'])]

Right now the bigrams are in the proprietary "WordList" format.  

We can convert them into a normal list by:
   * calling each part of the list using square brackets (**bigram[0]**, **bigram[1]**)
   * pasting them together using '+' with a space (' ') in the middle 

Here's a loop to clean them up:

In [114]:
bigramlist = []
for bigram in bigrams:
    bigram2 = bigram[0] + ' ' + bigram[1]
    bigramlist.append(bigram2)
bigramlist[:20]

['chapter i',
 'i the',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in',
 'in sussex']

Now, let's try that on all the novels, using a loop.

In [115]:
bigramlist = []

for novel in data:
    bigrams = TextBlob(novel).ngrams(n=2)
    for bigram in bigrams:
        bigram2 = bigram[0] + ' ' + bigram[1]
        bigramlist.append(bigram2)

bigramlist[:10]

['chapter i',
 'i the',
 'the family',
 'family of',
 'of dashwood',
 'dashwood had',
 'had long',
 'long been',
 'been settled',
 'settled in']

Now that we have all the bigrams in one list, we can also count the overall top bigrams

Let's count the most common bigrams with **value_counts()**:

In [135]:
bigramcounts = pd.Series.value_counts(bigramlist)
bigramcounts[:100]

to be            448
of the           444
in the           369
of her           291
it was           290
                ... 
miss dashwood     70
is a              70
out of            69
you have          69
to him            69
Length: 100, dtype: int64

## Stopwording a bigram list

Also, this is boring.  Now is a good time to apply stopwords.  

Why now? Because we have already done the n-gram analysis, and all of our n-grams are grammatically meaningful.

We can write a for loop to apply stopwords to the bigram list.

In [151]:
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

cleanbigramcounts = []
    
for n in range(len(bigramcounts)):
    bigram2 = bigramcounts.index[n].split()
    if (bigram2[1] and bigram2[0]) not in stopwords:
        cleanbigramcounts.append(bigramcounts[[n]]) # use double brackets to grab the whole row -- including the index

cleanbigramcounts[:20]

[mrs jennings    237
 dtype: int64, could not    167
 dtype: int64, colonel brandon    132
 dtype: int64, mrs dashwood    121
 dtype: int64, sir john    112
 dtype: int64, would not    101
 dtype: int64, would be    100
 dtype: int64, lady middleton    96
 dtype: int64, would have    86
 dtype: int64, must be    86
 dtype: int64, every thing    81
 dtype: int64, could be    77
 dtype: int64, mrs ferrars    76
 dtype: int64, marianne 's    74
 dtype: int64, miss dashwood    70
 dtype: int64, may be    62
 dtype: int64, elinor 's    62
 dtype: int64, said elinor    61
 dtype: int64, soon as    61
 dtype: int64, sister 's    54
 dtype: int64]

### Data wrangling diversion: splitting bigrams to check if they contain stopwords.

If you examine the for-loop above, you might notice that there are some tricky calls going on in the above for loop. That's because bigramcounts is a seris object -- with an index and a number.  
   * We have to call each bigram count index to get the bigram itself rather than the count.

In [141]:
bigramcounts[[2]]

in the    369
dtype: int64

In [142]:
bigramcounts.index[i]

'in the'

We have to split the bigram in two to get the individual words.

In [143]:
bigramcounts.index[i].split()

['in', 'the']

The we can use the 'in' operator to check if one of those words is in the stopwords

In [144]:
'in' in stopwords

True

The below line does the same thing -- calling the first word in the two-word bigram phrase from the index of bigramcounts, line 1.

In [145]:
bigramcounts.index[i].split()[1] in stopwords

True

Here's a tiny demo forloop that cycles through each word of a two-word phrase and asks if it is in stopwords.

In [146]:
for word in (bigramcounts.index[i].split()):
    print(word in stopwords)

True
True


The below line uses the 'and' operator to do the same thing -- cycling through each word of a two-word phrase and asking if it is in stopwords.

In [147]:
bigram2 = bigramcounts.index[1].split()
(bigram2[1] and bigram2[0]) in stopwords

True

We could switch the 'and' to 'or' in the loop above to find the bigrams that contain one or fewer stopwords in the formula. Why don't you try that and see?

### Filtering bigrams for a common word

Often we will want to look not at the entire set of bigrams but at bigrams that contain words that are meaningful for a certain kind of analysis, for instance verbs that follow 'she' and verbs that follow 'he.'

What if we only want the bigrams that include the word "she"?

In [152]:
she_bigrams = []

for bigram in bigramlist:
    if "she" in bigram: # notice the space after she.  It
        she_bigrams.append(bigram)
        
pd.Series.value_counts(she_bigrams)[:10]

she was      213
she had      193
that she     122
as she       116
she could    108
and she       96
she would     64
said she      47
she is        46
which she     45
dtype: int64

Unfortunately, the above code won't work for 'he,' because it will pick up other words that contain 'he,' including:
   * 'she' 
   * 'mother' 
   * 'heart' 
   * 'her'

In [153]:
he_bigrams = []

for bigram in bigramlist:
    if "he" in bigram: # notice the space after she.  It
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:30]

of the        444
in the        369
of her        291
to the        244
to her        234
she was       213
she had       193
on the        161
and the       160
at the        160
her sister    145
for the       138
her own       131
he was        126
that she      122
in her        120
her mother    118
by the        117
as she        116
that he       116
he had        113
they were     111
from the      109
she could     108
all the       104
the same      103
of their       97
and she        96
and her        96
with the       92
dtype: int64

### Refining our filter using Regular Expressions

The solution is to use "regular expressions," which are ways of coding the details of language.  You can communicate to the computer about such needs as detecting the beginning or end of a word by using two backslashes (an "escape" to tell the computer not to take the next letter literally) and "b" for "boundary."  If you tell the computer to find a "boundary" in this way, it will look for both spaces and for the end of strings.

Notice how I use two "\\\b"'s below to tell the computer to look for the word "he" but not "her" or "the." Python use the 're' package to detect regular expressions, and the .compile() and .match() commands

In [154]:
import re
pattern = re.compile("\\bhe\\b") #  notice the .compile() and the "escapes"+b to signify "word boundary"

In [155]:
he_bigrams = []

for bigram in bigramlist:
    if pattern.match(bigram): # notice the use of .match()
        he_bigrams.append(bigram)
        
pd.Series.value_counts(he_bigrams)[:10]

he was        126
he had        113
he is          75
he has         49
he did         37
he could       36
he would       28
he said        23
he should      23
he replied     22
dtype: int64

### Assignment

* Write a for-loop that looks for bigrams, trigrams, up to nine-grams in Jane Austen, by switching out '_' in the the 'n = _' parameters of the .ngrams() command.
* Add a line of .value_counts() to the for-loop.  Using the **len()** command, count how many multi-word phrases there are for each iteration, i.e.: how many 9-word phrases are there, total? how many 8-word phrases? how many 7-word phrases? etc. 
* Adjust the for-loop to save the results of **len()** in a list.  HINT: You may need to greate a new dummy variable for this count before the for loop.
* What are the longest phrases that are repeated more than 3 times across Austen's corpus? 
* Add to your for-loop a step to stopword each word in your analysis of phrases to make your results more meaningful.  What are the longest 3-word phrases in Jane Austen that don't include stopwords? HINT: Study the section labeled 'data digression' to split up your bigrams and test them individually.

Upload a screenshot of your code and the answer to Canvas.