For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-ngrams-lemmatization-gender

# Week 3 Mini Notebook: Lemmatization

This notebook is designed to introduce you to another cleaning step, called 'stemming' or 'lemmatization.'  Previously, we learned to clean text by lowercasing text, removing punctuation, and removing stopwords. 

When scholars practice wordcount, they often want to make singulars and plurals the same, so that the following words could be counted the same:

    * giraffe
    * giraffes
    
The easy way to fix this prblem is called "stemming." In stemming, we tell the computer to look for common endings -- like 's,' 'ed,' and 'ing' -- and remove them.

However, that process won't work for irregular words in English, for instance the following words:

    * wolf / wolves
    * woman / women

Fortunately, linguists have compiled lists of irregular words and their 'root lemma,' which is to say the 'wolf' in 'wolves.'  The process of 'lemmatization' is looking for irregular words and replacing them with their lemmas.  

We will use the **worddnet** command from the **nltk** package -- which was designed by linguists -- to lemmatize text correctly.

## Download some Jane Austen Novels

### Install some software

In [17]:
import nltk, numpy, re, matplotlib# , num2words

#### Troubleshooting -- if you have trouble, remember to install all new software according to the following formula (remove the hashtag to run the line:).

 You only need to run the install command once for each software package and your account.

In [18]:
#!pip install nltk --user

In [19]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [20]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Make sure that your data matches what you think it should.

In [21]:
# printing only first 2000 characters.
sas_data[:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

Looks good!

Isn't it getting tired, retyping the same command for each novel? Let's throw them all into one list -- which we'll call *data* --  so we can loop through them.

In [22]:
data = [sas_data, emma_data, pap_data]

Remember that we can call the first item in a list with square brackets.  Here is how you will call Sense and Sensibility:

    data[0]
    
If you only want to see the first 2000 characters of Sense and Sensibility, you call it this way:

    data[0][:2000]

In [23]:
data[0][:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

There still appear to be some errors where spaces have been replaced by "\n".  

The characters '\n' are a 'regular expression', or computer speak for 'white space goes here.'  You'll also see literal '\n''s in the text above -- an artifact of how the text was formatted. 

Let's get rid of those next, using *.replace*. We'll replace them with a normal space, or ' '.

In [24]:
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ') 
data[0][:2000]

"*       *       *       *       *     CHAPTER I   The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. The late owner of this estate was a single man, who lived to a very advanced age, and who for many years of his life, had a constant companion and housekeeper in his sister. But her death, which happened ten years before his own, produced a great alteration in his home; for to supply her loss, he invited and received into his house the family of his nephew Mr. Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. In the society of his nephew and niece, and their children, the old Gentleman's days were comfortably spent. His attachment to them all increased. The constant attention of Mr. a

We can also inspect each individual novel.

In [25]:
for novel in data:
    
    print(novel[:200].strip()) # the .strip() command removes whitespace that might prevent this from displaying properly
    print()

*       *       *       *       *     CHAPTER I   The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their proper

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world w

.   It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.  However little known the feelings or views of such a man may be on his first



## Cleaning the Novels

Now, let's lowercase the text and get rid of punctuation

#### Lowercase and strip punctuation

In [26]:
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression

Next, let's split the text of each novel into individual words using *.split().

Then we will filter out the stopwords.

#### Filter out stopwords

In [27]:
import pandas

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

cleandata = [] # create an empty list of clean novels

for novel in data:
    cleanwords = [] # create a dummy list of cleanwords
    words = novel.split() # split the words of the original novel up into a list of individual words.
    for word in words:
        if word not in stopwords:
            cleanwords.append(word)
    cleandata.append(cleanwords)

cleandata[0][:20]

['chapter',
 'family',
 'dashwood',
 'long',
 'settled',
 'sussex',
 'estate',
 'large',
 'residence',
 'norland',
 'park',
 'centre',
 'property',
 'many',
 'generations',
 'lived',
 'respectable',
 'manner',
 'engage',
 'general']


### Stemming

Stemming is another cleaning process. It makes it possible to count as similar words that have the same root, for example, "manner" and "manners."

Stemming refers to the computational process of normalizing singular and plural, past and present tense by removing the most common suffices in English, for instance "ed" or "ing".

We will use a standard **NLTK** package function, **PorterStemmer()**, to do the stemming. 

**PorterStemmer.stem()** takes one object, the word that needs to be stemmed.  

Note that below we are nicknaming PorterStemmer "st" for short. So the command will be **st.stem()** with the word to be stemmed as the object.

In [28]:
from nltk.stem import PorterStemmer
st = PorterStemmer()

In [30]:
st.stem('longs')

'long'

In [31]:
st.stem('daughters')

'daughter'

Let's try applying stemming to *cleandata* with a for loop and inspect the results.

In [32]:
stemmed_list = [] # create an empty list that we will fill in with stemmed words

for novel in cleandata: # move through each novel in the list *cleandata*
    for word in novel: # move through each word in the novel
        stemmed = st.stem(word) # stem that word
        stemmed_list.append(stemmed) # save the stemmed word to the new list, stemmed_list
        
stemmed_list[:20] 

['chapter',
 'famili',
 'dashwood',
 'long',
 'settl',
 'sussex',
 'estat',
 'larg',
 'resid',
 'norland',
 'park',
 'centr',
 'properti',
 'mani',
 'gener',
 'live',
 'respect',
 'manner',
 'engag',
 'gener']

As we can see, with stemming:

   * "settled" becomes "settl," which means that "settling" and "settler" will be counted together.
   * "residence" becomes "resid," which means it will be counted with "resident" and "residing."

Those counts will work well.


#### What's not so great about Stemming?

Stemming is a quick-and-dirty method. But it gives us strange results that we might not want to publish.

In [33]:
st.stem('women')

'women'

In [34]:
st.stem('families')

'famili'


But there are some questionable adjustments that we might disagree with.

   * "was" has become "wa." this won't help us to count "was" with "is" or "to be," which are other forms of the same verb.
   * "large" becomes "larg" (which means that "large" will be counted with "largely" and potentially "largo," which would be innacurate)
   * "families" becomes "famili," which means it wouldn't be counted accurately as the same word as "family."
   

Stemming, therefore, isn't what we want to use for cleaning text. It would give us inaccurate results.

### Lemmatization

Next, let's turn towards a more robust process.  In 'lemmatization,' the computer has been given a list of irregular words. It looks for them, and reduces every word to its "lemma," or root.

This process is extremely memory intensive, but the results are far more accurate.  

We will be using lemmatization, not stemming, to clean texts in this class.

First, install the wordnet command from the nltk.corpus package:

In [35]:
from nltk.corpus import wordnet as wn
nltk.download(‘wordnet’) 

SyntaxError: invalid character in identifier (<ipython-input-35-e00bed121b72>, line 2)

The command for "lemmatize" is **wn.morphy().** 

The .morphy() command takes one object: the word that needs to be lemmatized.

In [None]:
wn.morphy('families')

In [None]:
wn.morphy('women')

In [None]:
wn.morphy('aardwolves')

Let's write a loop to lemmatize every word in Jane Austen.

In [None]:
lemma_list = []

for novel in cleandata:
    for word in novel:
        lemma = wn.morphy(word)
        if not lemma:
            # word is not a valid english word so skip it
            continue
        lemma_list.append(lemma)

lemma_list[:20]

Lemmatization is often a more useful approach than stemming because it leverages an understanding of the word itself to convert the word back to its root word. 

Lemmatizing fixes some of the problems we saw with stemming:

   * "settled" becomes "settle"
   * "been" has become "be"
   * "large" is "large"

This is an improvement over stemming in many respects.

**However**

We still need to note some important oddities, which we should be aware of before we publish an interpretation based on lemmatization:

   * Words such as "was" is still replaced by "wa," which is an issue if we care about the word "to be."  
   * We would also see, if we looked further, that the word "that" has been replaced by "None"

In fact, there are ways that we can improve this process and correct these errors -- as we shall see in a few weeks when we use the full repertoire of tools to detect the grammatical significance of each word in context.  

For now, quick lemmatization gets us better results than stemming.  It produces a few errors that we would not want to publish. 

### The Analyst's Role

The analyst's role is to understand the strengths and weaknesses of every approach they use in text mining.

If we use the technique above to find lemmas, we should be aware of these potential difficulties so that we can eliminate any noise generated by oddities.  

For instance, you would not want to publish a graph showing "wa" or "None."  Those are merely data errors.  Nor would you want to write about them in a paragraph.

Occasionally, the assignments for these notebooks will ask you to think about why one approach is better than another. 

The purpose of thinking about how we clean is to produce better results -- word counts that don't have data-generated errors, graphs that don't have results that you can't explain, and interpretive paragraphs that make sense.  

## Counting with stemming and lemmatization

Understanding the difference between stemming and lemmatization is important because the choice of technique affects the result of counting.

Let's compare the counts generated by working with *stemmed_list* and *lemma_list*, the two lists of cleaned words that we generated above.

In [None]:
import pandas as pd

stemmed_count = pd.Series.value_counts(stemmed_list)
stemmed_count[:20]

In [None]:
lemma_count = pd.Series.value_counts(lemma_list)
lemma_count[:20]

#### Summing up

Here's all the code we learned in the notebook today, applied to the novel _Sense and Sensibility_.

You should be able to understand what each line of code does.

In [None]:
import pandas as pd
import nltk, numpy, re, matplotlib# , num2words
from nltk.stem import PorterStemmer
st = PorterStemmer()
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
from nltk.corpus import wordnet as wn
nltk.download('wordnet') 

In [41]:
# choose just one novel from the data
novel = data[0] 

# remove whitespace, lowercase, and 
novel.replace('\n', ' ') # remove whitespace
novel = novel.lower() # force to lowercase
novel = re.sub('[\",.;:?([)\]_*]', '', novel) # remove punctuation and special characters with regular expression

# split the novel into words, lemmatize each word, and remove stopwords
lemmas = [] # create a dummy list of lemmas
stems = [] # create a dummy list of stems
words = novel.split() # split the words of the original novel up into a list of individual words.
for word in words:
        if word not in stopwords: # filter out stopwords
            lemma = wn.morphy(word) # lemmatize each non-stopword
            stem = st.stem(word) # stem that word
            lemmas.append(lemma) # save the lemma for use later
            stems.append(stem) # save the stem for use later

# analyze your work
print("### Here are the top lemmatized words")
print(pd.Series.value_counts(lemmas)[:10]) # count all the clean results & show only the top results
print("")

print("### Here are the top word stems")
print(pd.Series.value_counts(stems)[:10]) # count all the clean results & show only the top results

### Here are the top lemmatized words
say       550
mrs       531
every     373
know      364
one       318
much      286
make      282
must      282
sister    259
time      238
dtype: int64

### Here are the top word stems
mr         710
elinor     614
could      575
would      512
mariann    486
said       380
everi      373
one        318
much       286
must       282
dtype: int64


### Assignment

1) For lemma_count and stemmed_count, expand the number of words you're looking at to 200 instead of 20.  Hint: you might have to use a for loop with 'print' to get the computer to display so many results.

   * What differences do you notice between the two methods of shortening words? 
   * What oddities should we be aware of that might affect the outcome of a wordcount analysis?

2) Lemmatize the text of Benjamin Bowsey's trial from last week.  

   * Generate list of top lemmatized words from the Bowsey trial.
   * Inspect the list of lemmas.  Remove any oddities or data errors from the lemma list before you visualize it.
   * Create a new word cloud with the clean lemmas.  Save it.
   * Compare the new word cloud with last week's wordcloud.  Tell us about 5 differences between the two charts.

3) Make three bar plots, one for each of the three Austen novels. 

   * Show the top twenty lemmas, stopworded and stripped of punctuation, from each novel. 
   * Entitle and label your bar plots appropiately. Label them with names (figure 1, etc.)
   * Write a paragraph of at least three sentences about what you observe to be the differences and similarities between the novels based on word count of top lemmas. Reference the figures by name in your paragraph.  

Put your answers and the two wordclouds into a word document and upload it to Canvas.