For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-ngrams-lemmatization-gender

# Week 3 Mini Notebook: Lemmatization

This notebook is designed to introduce you to another cleaning step, called 'stemming' or 'lemmatization.'  Previously, we learned to clean text by lowercasing text, removing punctuation, and removing stopwords. 

When scholars practice wordcount, they often want to make singulars and plurals the same, so that the following words could be counted the same:

    * giraffe
    * giraffes
    
The easy way to fix this prblem is called "stemming." In stemming, we tell the computer to look for common endings -- like 's,' 'ed,' and 'ing' -- and remove them.

However, that process won't work for irregular words in English, for instance the following words:

    * wolf / wolves
    * woman / women

Fortunately, linguists have compiled lists of irregular words and their 'root lemma,' which is to say the 'wolf' in 'wolves.'  The process of 'lemmatization' is looking for irregular words and replacing them with their lemmas.  

We will use the **worddnet** command from the **nltk** package -- which was designed by linguists -- to lemmatize text correctly.

## Download some Jane Austen Novels

In [1]:
import nltk, numpy, re, matplotlib# , num2words

In [2]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [3]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Make sure that your data matches what you think it should.

In [4]:
# printing only first 2000 characters.
sas_data[:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

Looks good!

Isn't it getting tired, retyping the same command for each novel? Let's throw them all into one list -- which we'll call *data* --  so we can loop through them.

In [9]:
data = [sas_data, emma_data, pap_data]

Remember that we can call the first item in a list with square brackets.  Here is how you will call Sense and Sensibility:

    data[0]
    
If you only want to see the first 2000 characters of Sense and Sensibility, you call it this way:

    data[0][:2000]

In [5]:
data[0][:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

There still appear to be some errors where spaces have been replaced by "\n".  

The characters '\n' are a 'regular expression', or computer speak for 'white space goes here.'  You'll also see literal '\n''s in the text above -- an artifact of how the text was formatted. 

Let's get rid of those next, using *.replace*. We'll replace them with a normal space, or ' '.

In [8]:
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ') 
data[0][:2000]

"*       *       *       *       *     CHAPTER I   The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. The late owner of this estate was a single man, who lived to a very advanced age, and who for many years of his life, had a constant companion and housekeeper in his sister. But her death, which happened ten years before his own, produced a great alteration in his home; for to supply her loss, he invited and received into his house the family of his nephew Mr. Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. In the society of his nephew and niece, and their children, the old Gentleman's days were comfortably spent. His attachment to them all increased. The constant attention of Mr. a

Next, let's split the text of each novel into individual words using *.split().

For each iteration of the loop -- that is, for each novel -- we'll print the first twenty words.

In [11]:
import pandas

for novel in data:
    words = novel.split()
    print(words[:20]) 

['*', '*', '*', '*', '*', 'CHAPTER', 'I', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex.', 'Their', 'estate', 'was']
['Emma', 'Woodhouse,', 'handsome,', 'clever,', 'and', 'rich,', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition,', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best']
['.', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in']


## Cleaning the Novels

Now, let's lowercase the text and get rid of punctuation

In [12]:
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression


### Stemming

Stemming is the process of removing suffices, like "ed" or "ing".

We will use another standard NLTK package, PorterStemmer, to do the stemming.



In [24]:
from nltk.stem import PorterStemmer

st = PorterStemmer()

stemmed_list = []

for novel in data:
    words = novel.split()
    for word in words:
        stemmed = st.stem(word)
        stemmed_list.append(stemmed)
        
stemmed_list[:20] # i have changed this so you print just the first words
# printing all the words is actually way more computer intensive than it may seem

['chapter',
 'i',
 'the',
 'famili',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settl',
 'in',
 'sussex',
 'their',
 'estat',
 'wa',
 'larg',
 'and',
 'their',
 'resid',
 'wa']

As we can see, with stemming:

   * "settled" becomes "settl," which means that "settling" and "settler" will be counted together.
   * "residence" becomes "resid," which means it will be counted with "resident" and "residing."

Those counts will work well.


#### What's not so great about Stemming?

Stemming is a quick-and-dirty method. But it gives us strange results that we might not want to publish.


But there are some questionable adjustments that we might disagree with.

   * "was" has become "wa." this won't help us to count "was" with "is" or "to be," which are other forms of the same verb.
   * "large" becomes "larg" (which means that "large" will be counted with "largely" and potentially "largo," which would be innacurate)
   * "families" becomes "famili," which means it wouldn't be counted accurately as the same word as "family."
   

Stemming, therefore, isn't what we want to use for cleaning text. It would give us inaccurate results.

### Lemmatization

Next, let's turn towards a more robust process.  In 'lemmatization,' the computer has been given a list of irregular words. It looks for them, and reduces every word to its "lemma," or root.

This process is extremely memory intensive, but the results are far more accurate.  

We will be using lemmatization, not stemming, to clean texts in this class.

First, install the wordnet command from the nltk.corpus package:

In [19]:
from nltk.corpus import wordnet as wn

The command for "lemmatize" is **wn.morphy().** 

The .morphy() command takes one object: the word that needs to be lemmatized.

In [26]:
wn.morphy('aardwolves')

'aardwolf'

Let's write a loop to lemmatize every word in Jane Austen.

In [22]:
lemma_list = []

for novel in data:
    words = novel.split()
    for word in words:
        lemma = wn.morphy(word)
        if not lemma:
            # word is not a valid english word so skip it
            continue
        lemma_list.append(lemma)

lemma_list[:20]

['chapter',
 'i',
 'family',
 'have',
 'long',
 'be',
 'settle',
 'in',
 'sussex',
 'estate',
 'wa',
 'large',
 'residence',
 'wa',
 'at',
 'park',
 'in',
 'centre',
 'property',
 'many']

Lemmatization is often a more useful approach than stemming because it leverages an understanding of the word itself to convert the word back to its root word. 

Lemmatizing fixes some of the problems we saw with stemming:

   * "settled" becomes "settle"
   * "been" has become "be"
   * "large" is "large"

This is an improvement over stemming in many respects.

**However**

We still need to note some important oddities, which we should be aware of before we publish an interpretation based on lemmatization:

   * Words such as "was" is still replaced by "wa," which is an issue if we care about the word "to be."  
   * We would also see, if we looked further, that the word "that" has been replaced by "None"

If we count lemmas, we should be aware of these potential difficulties so that if we graph them we can eliminate any noise generated by these oddities.  

Stemming and lemmatization are important because they matter for how we count.

In [32]:
import pandas as pd

stemmed_count = pd.Series.value_counts(stemmed_list)
stemmed_count[:20]

to      4214
the     4191
of      3692
and     3543
her     2598
a       2161
i       2014
in      1992
wa      1896
it      1896
she     1629
be      1501
that    1403
for     1282
not     1281
as      1247
you     1239
he      1125
hi      1048
had     1032
dtype: int64

In [33]:
lemma_count = pd.Series.value_counts(lemma_list)
lemma_count[:20]

be      3013
a       2161
i       2014
in      1992
have    1952
it      1896
wa      1896
not     1281
as      1247
he      1125
but      862
at       848
by       761
on       703
do       690
all      674
so       661
say      572
no       570
mrs      534
dtype: int64

### Assignment

1) For lemma_count and stemmed_count, expand the number of words you're looking at to 200 instead of 20.  

   * What differences do you notice? 
   * What oddities should we be aware of?

2) Lemmatize the text of Benjamin Bowsey's trial from last week.  

   * Generate a new word cloud with the lemmatized counts and save it.  
   * Compare the new word cloud with last week's wordcloud.  Tell us about 5 differences.

Put your answers and the two wordclouds into a word document and upload it to Canvas.