For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week3-controlled-vocab

# Week 3 Mini Notebook: Lemmatization

## Download some Jane Austen Novels

In [1]:
import nltk, numpy, re, matplotlib# , num2words

In [2]:
cd /scratch/group/history/hist_3368-jguldi

/scratch/group/history/hist_3368-jguldi


In [3]:
#download some data

with open('senseandsensibility.txt', 'r') as myfile:
    sas_data = myfile.read().split('\n\n"I suppose you know, ma\'am, that Mr. Ferrars is married"\n\nIt _was_ Edward\n\n"Everything in such respectable condition"\n\n ')[1].split('THE END')[0].strip()

with open('emma.txt', 'r') as myfile:
    emma_data = myfile.read().split('CHAPTER I')[1].split('FINIS')[0].strip()

with open('prideandprejudice.txt', 'r') as myfile:
    pap_data = myfile.read().split('CHAPTER I')[1].split('End of the Project Gutenberg EBook of Pride and Prejudice, by Jane Austen')[0].strip()


Make sure that your data matches what you think it should.

In [4]:
# printing only first 2000 characters.
sas_data[:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

Looks good!

Isn't it getting tired, retyping the same command for each novel? Let's throw them all into one data set so we can loop through them.


In [5]:

data = [sas_data, emma_data, pap_data]
data[0][:2000]

"*       *       *       *       *\n\n\n\n\nCHAPTER I\n\n\nThe family of Dashwood had long been settled in Sussex. Their estate\nwas large, and their residence was at Norland Park, in the centre of\ntheir property, where, for many generations, they had lived in so\nrespectable a manner as to engage the general good opinion of their\nsurrounding acquaintance. The late owner of this estate was a single\nman, who lived to a very advanced age, and who for many years of his\nlife, had a constant companion and housekeeper in his sister. But her\ndeath, which happened ten years before his own, produced a great\nalteration in his home; for to supply her loss, he invited and\nreceived into his house the family of his nephew Mr. Henry Dashwood,\nthe legal inheritor of the Norland estate, and the person to whom he\nintended to bequeath it. In the society of his nephew and niece, and\ntheir children, the old Gentleman's days were comfortably spent. His\nattachment to them all increased. The consta

There still appear to be some errors where spaces have been replaced by "\n".  We'll get rid of those in a second.

In [6]:
for i in range(len(data)):
    data[i] = data[i].replace('\n', ' ')
data[0][:2000]

"*       *       *       *       *     CHAPTER I   The family of Dashwood had long been settled in Sussex. Their estate was large, and their residence was at Norland Park, in the centre of their property, where, for many generations, they had lived in so respectable a manner as to engage the general good opinion of their surrounding acquaintance. The late owner of this estate was a single man, who lived to a very advanced age, and who for many years of his life, had a constant companion and housekeeper in his sister. But her death, which happened ten years before his own, produced a great alteration in his home; for to supply her loss, he invited and received into his house the family of his nephew Mr. Henry Dashwood, the legal inheritor of the Norland estate, and the person to whom he intended to bequeath it. In the society of his nephew and niece, and their children, the old Gentleman's days were comfortably spent. His attachment to them all increased. The constant attention of Mr. a

Next, let's split the text into words and print the first word of each.

In [7]:
import pandas

for novel in data:
    words = novel.split()
    print(words[:20]) 


['*', '*', '*', '*', '*', 'CHAPTER', 'I', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex.', 'Their', 'estate', 'was']
['Emma', 'Woodhouse,', 'handsome,', 'clever,', 'and', 'rich,', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition,', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best']
['.', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in']


## Cleaning the Novels

Now, let's lowercase the text and get rid of punctuation

In [8]:
import re

for i in range(len(data)):
    # data[i] is the current novel
    data[i] = data[i].lower() # force to lowercase
    data[i] = re.sub('[\",.;:?([)\]_*]', '', data[i]) # remove punctuation and special characters with regular expression


### Stemming

Stemming is the process of removing suffices, like "ed" or "ing".

We will use another standard NLTK package, PorterStemmer, to do the stemming.



In [9]:
from nltk.stem import PorterStemmer

st = PorterStemmer()

stemmed_list = []

for novel in data:
    words = novel.split()
    for word in words:
        stemmed = st.stem(word)
        stemmed_list.append(stemmed)
        
stemmed_list[:20] # i have changed this so you print just the first words
# printing all the words is actually way more computer intensive than it may seem

['chapter',
 'i',
 'the',
 'famili',
 'of',
 'dashwood',
 'had',
 'long',
 'been',
 'settl',
 'in',
 'sussex',
 'their',
 'estat',
 'wa',
 'larg',
 'and',
 'their',
 'resid',
 'wa']

As we can see, "universal" becomes "univers" (which means that "universally" will be counted with "universal" and "universe") and "single" becomes "singl" (which means it would be counted with "singled").  But "acknowledged" has been left as it is."

### Lemmatization

Let's pick up another term -- lemmatization -- which is extremely memory intensive, but far more accurate.  

In [10]:
from nltk.corpus import wordnet as wn

wn.morphy('aardwolves')

'aardwolf'

In [11]:
lemma_list = []

for novel in data:
    words = novel.split()
    for word in words:
        lemma = wn.morphy(word)
        if not lemma:
            # word is not a valid english word so skip it
            continue
        lemma_list.append(lemma)

lemma_list[:20]

['chapter',
 'i',
 'family',
 'have',
 'long',
 'be',
 'settle',
 'in',
 'sussex',
 'estate',
 'wa',
 'large',
 'residence',
 'wa',
 'at',
 'park',
 'in',
 'centre',
 'property',
 'many']

Lemmatization is often a more useful approach than stemming because it leverages an understanding of the word itself to convert the word back to its root word. "Acknowledged"  becomes "acknowledge," and "daughters" becomes "daughter."  

Note some important oddities -- words such as "that" are replaced by "None," so if we count lemmas to graph them we will want to eliminate this noise.  

Stemming and lemmatization are important because they matter for how we count.

In [12]:
from collections import Counter


count = Counter(stemmed_list)
print(count.most_common(100))

[('to', 4214), ('the', 4191), ('of', 3692), ('and', 3543), ('her', 2598), ('a', 2161), ('i', 2014), ('in', 1992), ('wa', 1896), ('it', 1896), ('she', 1629), ('be', 1501), ('that', 1403), ('for', 1282), ('not', 1281), ('as', 1247), ('you', 1239), ('he', 1125), ('hi', 1048), ('had', 1032), ('with', 1010), ('have', 920), ('but', 862), ('at', 848), ('is', 780), ('by', 761), ('mr', 758), ('on', 703), ('all', 674), ('so', 661), ('him', 649), ('my', 638), ('elinor', 614), ('which', 600), ('could', 588), ('no', 570), ('from', 554), ('would', 527), ('veri', 525), ('they', 524), ('their', 506), ('mariann', 486), ('them', 484), ('been', 454), ('were', 451), ('what', 443), ('thi', 442), ('me', 429), ('more', 414), ('ani', 409), ('your', 407), ('said', 393), ('everi', 388), ('will', 385), ('such', 373), ('than', 372), ('do', 368), ('or', 360), ('an', 347), ('one', 333), ('when', 317), ('must', 305), ('if', 303), ('much', 301), ('onli', 299), ('own', 284), ('know', 282), ('who', 276), ('time', 264),