For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week1-intro-to-jupyter

# Hist 3368 - Week 2 - Cleaning Text

* lightly amended from Alison Parrish, Lauren Klein, and the Programming Historian William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0003.*

Very often, when working with text, we'll want to "clean" up the results of how the computer first reads them.  What does this mean?  When a computer reads human texts, the following situations frequently occur:

    * The computer thinks that a piece of punctuation like a comma or semicolon is literally part of the word.
    * The computer wants to count singular words (like 'tree') as a different word from the same word in the plural ('trees)
    
We might add one further difficulty:

    * The computer's counts of important words, by default, are meaningless. In almost every human text, the most frequently appearing words are "the," "and," "of," and "in." If asked to count the most frequent words, a human might simply choose to ignore those words.

Thus the computer, by default, counts words quite differently from the way you or I might, if instructed to count words.

To teach the computer how to count words usefully, we need to "clean" the text:

    * We will remove punctuation. This is called "stripping punctuation"
    * We will create a list of "stopwords," generated from either a default list or from counting the most frequent words in a list.  
    
There is also one more major cleaning step that we will teach later, next week:

    * We will normalize plural words ("trees") and past-tense words ("yielded") by removing the suffixes "s" and "ed". This is known as "stemming" or "lemmatizing" depending on the technique.

In this lesson, we will examine basic strategies for cleaning text that will be a part of our work for the rest of the semester.

Throughout the semester, you will return to these cleaning functions again and again. Why? Because no default cleaning cocktail works all the time. Counting words, you may frequently find that some stray piece of punctuation escaped your cleaning, or that the most frequent words counted are meaningless for your analysis.  

Again and again, you will need to choose new stop words. You will clean new punctuation marks.  You will adjust the text. Learning the basic commands for cleaning text will be useful to you for the rest of the semester and indeed whenever you choose to text mine in the future.


# Let's Learn About Cleaning with Sample Text

First, let's generate a sample text.  We'll use two functions we learned about previously -- .split() and pd.Series.value_counts -- to generate a raw word count.

In [5]:
sampletext = "Once upon a time, in a kingdom far, far away, there lived a young woman.  In other times, there had been other women, but this young woman was special."

In [6]:
samplewords = sampletext.split()

In [9]:
import pandas as pd
samplecount = pd.Series.value_counts(samplewords)
samplecount

a           3
young       2
other       2
there       2
times,      1
far         1
away,       1
kingdom     1
far,        1
special.    1
been        1
time,       1
woman.      1
lived       1
women,      1
woman       1
in          1
Once        1
this        1
In          1
was         1
had         1
upon        1
but         1
dtype: int64

Notice that there are some properties of this word count that wouldn't correspond to how a human would have counted the words. 

    * "Far" and "far," are counted as separate words
    * "woman.", "women," and "woman" are counted as separate words.
    * "In" (with a capital I) and "in" (lowercase) are counted separately.
    
Let's explore some commands that will tell Python to count these words in a more intuitive way.

### Lowercasing

A first command to teach Python to treat text more intuitively is called "lowercasing."  Lowercasing normalized capitals and lowercase letters.

In Python, we execute lowercasing with the function *.lower().

We will apply .lower() to the original text, sampletext. 

Then we will use .split() and .value_counts() as we did before.

In [12]:
sampletextlowered = sampletext.lower()
samplewordslowered = sampletextlowered.split()
samplecountlowered = pd.Series.value_counts(samplewordslowered)
samplecountlowered

a           3
young       2
other       2
in          2
there       2
times,      1
far         1
away,       1
kingdom     1
once        1
been        1
time,       1
woman.      1
far,        1
lived       1
women,      1
was         1
this        1
woman       1
special.    1
had         1
upon        1
but         1
dtype: int64

Notice that the count of "in" is now 2 instead of 1.  Notice that there are no longer capitalized words in the count.

### Stripping punctuation

We can use the function .strip() to remove certain characters.


We can pull a list of punctuation from the cloud by importing the "string" package and inspecting the object string.punctuation.

In [27]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

We will use the command .replace() to work on removing punctuation.  

   * Replace only works on one item of punctuation at a time, so we need a "for" loop to cycle through each punctuation mark in string.punctuation
   * Replace takes two arguments: first, the string to find; second, what to replace it with.  
        * We will replace any given punctuation item with an empty string: ""

In [31]:
nopunctuationtext = sampletextlowered
for c in string.punctuation:
    nopunctuationtext = nopunctuationtext.replace(c, "")
    
print(nopunctuationtext)

once upon a time in a kingdom far far away there lived a young woman  in other times there had been other women but this young woman was special


In [None]:
Great! Now we .split() and .value_counts() as before.

In [32]:
nopunctuationwords = nopunctuationtext.split()
nopunctuationcounts = pd.Series.value_counts(nopunctuationwords)
nopunctuationcounts

a          3
other      2
far        2
woman      2
in         2
young      2
there      2
kingdom    1
once       1
time       1
been       1
this       1
times      1
lived      1
was        1
away       1
special    1
women      1
had        1
upon       1
but        1
dtype: int64

Notice that "far," (with a comma) and "far" (without a comma) are now counted together.  There are two of them.

### Stopwords

Very frequently, human analysts don't care about *absolute* most frequent words in a text.  That is because, statistically, those most frequent words are also boring: words such as "the", "an," and "in."

Coders who work with text frequently often compile generic stopwords lists for the English language.  We can reach into the cloud and download one from the software package "nltk" with the following commands:

In [35]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Note that we can always expand this list, depending on what we're studying in a text, by using .append() to add new words to this list.

In [41]:
stopwords2 = stopwords.words('english')
stopwords2.append("boringword")

stopwords2[-5:]

['won', "won't", 'wouldn', "wouldn't", 'boringword']

Now we can write a for loop to cycle through the words in nopunctuationwords and keep only the ones that aren't stopwords.

   * First, we create an empty list using square brackets. We're calling it *goodwords*.  Right now there's nothing in the list, but we'll be adding to it, one good word at a time.
   
            goodwords = []
            
       
   * Next, we write a for loop to cycle through each item in the Series nopunctuationwords:
   
           for word in nopunctuationwords:
           
       
   * Next, we write a conditional statement using "if", "in", and "not" to test the condition of whether each word is in the stopwordslist
   
           if word not in stopwords2:
           
           
   * We add the words that meet the condition in the "if" statement by appending those words to our new list, *goodwords*
   
           goodwords.append(word)
           
           

In [45]:
goodwords = []

for word in nopunctuationwords:
    if word not in stopwords2:
        goodwords.append(word)

goodwords

['upon',
 'time',
 'kingdom',
 'far',
 'far',
 'away',
 'lived',
 'young',
 'woman',
 'times',
 'women',
 'young',
 'woman',
 'special']

We can now count these words as before. 

In [46]:
goodwordcounts = pd.Series.value_counts(goodwords)
goodwordcounts

far        2
young      2
woman      2
special    1
lived      1
kingdom    1
upon       1
women      1
time       1
times      1
away       1
dtype: int64

Notice that this list of counted words is *juicier* than the lists before.  It counts only the words that would be used to tell a story -- "young," "woman," "special," "kingdom," etc.  These words give the flavor of the two sentences above.

# Let's count with some real text from the internet.

As before, we are loading text from the internet.  Don't worry about the commands below; we will give you the code to load many texts from the internet in this class. 

Let's load some text and use square brackets -- [:2000] -- to look at the first 2000 characters. 

In [58]:
import urllib.request, urllib.error, urllib.parse, bs4 as bs
source = urllib.request.urlopen('http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33')
soup = bs.BeautifulSoup(source, 'lxml')
text = soup.get_text()
print(text[:2000])



browse - central criminal court














jump to contentjump to main navigationjump to section navigation


the proceedings of the old bailey
london's central criminal court, 1674 to 1913


main navigationhomesearchabout the proceedingshistorical backgrounddatathe projectcontact





 benjamin bowsey. breaking peace: riot. 28th june 1780reference numbert17800628-33verdictguiltysentencedeathrelated material associated recordsactionscite this textold bailey proceedings online (www.oldbaileyonline.org, version 8.0, 30 august 2021), june 1780, trial of                      benjamin                      bowsey                                                                               (t17800628-33).close | print-friendly version | report an errornavigation< previous text (trial account) | next text (trial account) >see original 324.                                                        benjamin                      bowsey                                                           

### Define and Remove Custom Stopwords

If you read the sample text we scraped from the internet, you'll see that there's an eighteenth-century court case. But there's also a lot of gobbledegook from the Internet -- URL's, section headers, etc.  We don't want to count these or read them.

Fortunately, we can use the same procedure as we did for stripping punctuation and stopwording to get rid of the gobbledegook.  

We'll begin by creating a custom stopwords list called "extrawords."

In [72]:
# we can define a set of words that seem to come from the webpage that we don't want to use
extrawords = ["[]", "_gaq.push(['_setaccount'", "browse", "-", "central", "criminal", "court", "var", "_gaq", "=", "_gaq", "||", "[];", 
              "_gaq.push([", "_setaccount", "ua-19174022-1]", "_gaq.push([", "\"_gaq.push(['_setaccount'\"",
              "_trackpageview","(function()", "{", "var", "ga", "=", "document.createelement",
              "'ua-19174022-1'", "_gaq.push(['_trackpageview'",
              "script", "ga.type", "=", "text/javascript", "ga.async", "=", "true;", "ga.src", "=", "https:",
              "==", "document.location.protocol", "?", "https://ssl", ":", "http://www", "+", 
              ".google-analytics.com/ga.js", "var", "s", "=", "document.getelementsbytagname",
              "s.parentnode.insertbefore(ga", "s", "})();", "jump", "to", "contentjump", "to", "main",
              "navigationjump", "section", "navigation",  "proceedings", 
              '(www.oldbaileyonline.org', '8.0', '2020)', '1780',
              "'ua-19174022-1'])", "_gaq.push(['_trackpageview'])", "document.createelement('script')", 
              "'text/javascript'", 'true', "('https:'", "'https://ssl'", '', "'http://www')",
              "'.google-analytics.com/ga.js'", "document.getelementsbytagname('script')[0]", 's)', '})()',
              "bailey",  "1674", "to", "1913", "main", "navigationhomesearchabout",
              "proceedingshistorical", "backgrounddatathe", "projectcontact", "benjamin", "bowsey", 
              "breaking", "peace:", "riot.", "28th", "june", "1780reference", 
              "numbert17800628-33verdictguiltysentencedeathrelated", "material", "associated", 
              "recordsactionscite", "this", "textold", "bailey", "proceedings", "online", 
              '1674-1834', 'api', 'demonstrator', '<!--', 'google_ad_client', 'pub-6166712890256554', 
              '/*', '180x150', 'created', '21/11/08', '*/', 'google_ad_slot', '3829571269', 'google_ad_width',
              '180', 'google_ad_height', '150', '//-->', '<!--', 'google_ad_client', 'pub-6166712890256554', '/*',
              '180x150', 'created', '21/11/08', '*/', 'google_ad_slot', '1983343858', 'google_ad_width', '180', 
              'google_ad_height', '150', '//-->', '<!--', 'google_ad_client', 'pub-6166712890256554', '/*', 
              '180x150', 'created', '21/11/08', '*/', 'google_ad_slot', '9176171409', 'google_ad_width', '180', 
              'google_ad_height', '150', '//-->', 'footer', 'march', '2018', '©', '2003-2018',  
              'www.oldbaileyonline.org', '2020', 't17800628-33).close', "'324'",
              "_gaq.push(['_setaccount", "_gaq.push(['_trackpageview", "document.createelement('script", "document.getelementsbytagname('script')[0",
              "_gaq.push(['_setaccount',", "ua-19174022-1']);", "_gaq.push(['_trackpageview']);", 
              "document.createelement('script');", "text/javascript';", "('https:", "http://www')", 
              ".google-analytics.com/ga.js';", "document.getelementsbytagname('script')[0];", 's.parentnode.insertbefore(ga,', 's);',
               'web', 'site', 'sitemap', 'copyright', '&', 'citation', 'guide', 'visual', 'design', 'technical',
              'design', 'xml', 'feedback', 'ua-19174022-1', "'www.oldbaileyonline.org'", "'2020'", "'t17800628-33).close'", 
              "'ua-19174022-1']", "_gaq.push(['_trackpageview']", 'function', "document.createelement('script'", "'https:'", "'http://www'", '}', 
              "(www.oldbaileyonline.org,", "version", "8.0,", "08", "september", "2020),", "june", "1780,", 
              '"pub-6166712890256554";', '180x150,', '"3829571269";', '180;', '150;', '"pub-6166712890256554";', '180x150,', 
              '"1983343858";', '180;', '150;', '"pub-6166712890256554";', '180x150,', '"9176171409";', '180;', '150;',
              "trial", "of", "benjamin", "bowsey", "(t17800628-33).close", "|", "print-friendly", "version", 
              "|", "report", "errornavigation<", "previous", "text", "(trial", "account)", "|", "next", "recordsactionscite", 
              "text", "(trial", "account)", ">see", "original", "324.", "navigationjump", "contentjump", "browse", "navigationhomesearchabout",
             "proceedingshistorical", "1780reference", "navigationjump", "navigationhomesearchabout", "navigationhomesearchabout", "proceedingshistorical", "backgrounddatathe", "projectcontact",
              "backgrounddatathe", "(trial", "account)", "projectcontact", "(t17800628-33).close", "(www.oldbaileyonline.org,", "numbert17800628-33verdictguiltysentencedeathrelated",
             ]
#print(extrawords)

In [75]:
# we can get rid of the extrawords using a loop

for e in extrawords:
    cleaned = text.replace(e, "")

cleaned =  ' '.join(cleaned.split()) # this command -- which we didn't touch on -- removes extra whitespace.

print(cleaned[:2000])


browse - central criminal court jump to contentjump to main navigationjump to section navigation the proceedings of the old bailey london's central criminal court, 1674 to 1913 main navigationhomesearchabout the proceedingshistorical backgrounddatathe projectcontact benjamin bowsey. breaking peace: riot. 28th june 1780reference material associated recordsactionscite this textold bailey proceedings online (www.oldbaileyonline.org, version 8.0, 30 august 2021), june 1780, trial of benjamin bowsey (t17800628-33).close | print-friendly version | report an errornavigation< previous text (trial account) | next text (trial account) >see original 324. benjamin bowsey (a blackmoor ) was indicted for that he together with five hundred other persons and more, did, unlawfully, riotously, and tumultuously assemble on the 6th of june to the disturbance of the public peace and did begin to demolish and pull down the dwelling house of richard akerman , against the form of the statute, &c. rose jenning

After cleaning, the text looks much better, doesn't it?  

   * There are still a smattering of misplaced matter the beginning of the passage, left over from the webpage heading. 
   * But by and large, you can just read the text like a transcript of the trial at this point, word for word. 
   * Later this week we'll ask you to do just that.  Bear in mind that you can read the whole thing right now with the command 'print(cleaned)' (without the square brackets limiting the viewing box to the first 2000 characters)

#### Lowercase, remove punctuation, and stopword

Now, let's use the tools we saw above.

In [80]:
# remove punctuation
for c in string.punctuation:
    cleaned = cleaned.replace(c, "")

cleaned[:2000]

'browse  central criminal court jump to contentjump to main navigationjump to section navigation the proceedings of the old bailey londons central criminal court 1674 to 1913 main navigationhomesearchabout the proceedingshistorical backgrounddatathe projectcontact benjamin bowsey breaking peace riot 28th june 1780reference material associated recordsactionscite this textold bailey proceedings online wwwoldbaileyonlineorg version 80 30 august 2021 june 1780 trial of benjamin bowsey t1780062833close  printfriendly version  report an errornavigation previous text trial account  next text trial account see original 324 benjamin bowsey a blackmoor  was indicted for that he together with five hundred other persons and more did unlawfully riotously and tumultuously assemble on the 6th of june to the disturbance of the public peace and did begin to demolish and pull down the dwelling house of richard akerman  against the form of the statute c rose jennings  esq sworn had you any occasion to be

In [82]:
# lowercase and split into words
cleanlowercasewords = cleaned.lower().split()
cleanlowercasewords[:20]

['browse',
 'central',
 'criminal',
 'court',
 'jump',
 'to',
 'contentjump',
 'to',
 'main',
 'navigationjump',
 'to',
 'section',
 'navigation',
 'the',
 'proceedings',
 'of',
 'the',
 'old',
 'bailey',
 'londons']

In [84]:
# use the standard stopwords list
stopworded = []

for word in cleanlowercasewords:
    if word not in stopwords2:
        stopworded.append(word)

stopworded[:20]

['browse',
 'central',
 'criminal',
 'court',
 'jump',
 'contentjump',
 'main',
 'navigationjump',
 'section',
 'navigation',
 'proceedings',
 'old',
 'bailey',
 'londons',
 'central',
 'criminal',
 'court',
 '1674',
 '1913',
 'main']

In [86]:
counted = pd.Series.value_counts(stopworded)
counted[:20]

house       23
yes         20
mr          19
prisoner    18
man         16
mob         14
black       13
night       12
saw         11
see         10
akermans    10
know         9
fire         9
june         9
face         9
sworn        9
went         9
believe      8
thing        8
room         8
dtype: int64

#### General advice about cleaning

It's important to check the text as you work. At many times throughout the course of a text-mining project,  good practice requires "inspecting" the data set to see how well a particular operation worked.  In this case, we're checking our work cleaning.

Do we like the resulting list of top words? 

   * To my eye, many of the resulting words are boring.  They just look like regular features of eighteenth-century trials.  I think I'd get more out of this list if I removed "yes" and "mr" from the list.
   
Part of doing a good job cleaning is asking yourself if you like the results, and trying again if necessary.

### Expanding a stopwords list

Let's add two more words to our stopwords list and run the count again.

In [94]:
stopwords3 = stopwords2
stopwords3.append("mr")
stopwords3.append("yes")

In [95]:
stopworded = []

for word in cleanlowercasewords:
    if word not in stopwords3:
        stopworded.append(word)

counted = pd.Series.value_counts(stopworded)
counted[:20]

house       23
prisoner    18
man         16
mob         14
black       13
night       12
saw         11
akermans    10
see         10
face         9
fire         9
june         9
know         9
sworn        9
went         9
time         8
room         8
believe      8
thing        8
lodging      7
dtype: int64

# What did we do?

Next, let's ask some basic questions about what we just did.

How many words did we originally download from the internet?

In [100]:
len(text.split())

2874

How many words were eliminated after after we cleaned them?


In [102]:
len(text.split()) - len(stopworded)

1676