For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week1-intro-to-jupyter

# COUNTING WORDS IN PYTHON

 lightly amended from Alison Parrish, Lauren Klein, and the Programming Historian William J. Turkel and Adam Crymble, "Counting Word Frequencies with Python," The Programming Historian 1 (2012), https://doi.org/10.46430/phen0003.


In [25]:
# let's play with strings
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'
print(wordstring)

it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness


In [26]:
wordlist = wordstring.split()
print(wordlist)

['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']


Counting how many times something occurs is a very common task in programming, so Python includes a special kind of object—a Counter—to make the task easier.

To use the Counter object, you need to import it from the collections module.

In [27]:
from collections import Counter

Let's say that you wanted to know how many times each letter occurred. To do this in Python, just pass the list to the Counter object like so:


In [28]:
count = Counter(wordlist)
count

Counter({'it': 4,
         'was': 4,
         'the': 4,
         'best': 1,
         'of': 4,
         'times': 2,
         'worst': 1,
         'age': 2,
         'wisdom': 1,
         'foolishness': 1})

Here we've assigned a Counter object to the variable count (although you can use whatever variable name you want, of course). Just evaluating the count object shows the counts for each of the items in the list, telling us (for example) that the word 'it' occurs five times, the word 'times' occurs two times, and so forth.

You can get the count for a particular item in the Counter by using indexing syntax as though it were a dictionary:

In [29]:
count["times"]

2

The .most_common() method returns a list of tuples of the most common items in the counter, in descending order by number:


In [30]:
count.most_common()

[('it', 4),
 ('was', 4),
 ('the', 4),
 ('of', 4),
 ('times', 2),
 ('age', 2),
 ('best', 1),
 ('worst', 1),
 ('wisdom', 1),
 ('foolishness', 1)]

# Let's count with some real text from the internet.

In [31]:
# Import some text from online.  This isn't a step that you'll need to repeat -- since you'll be
# getting pre-packaged data in a Jupyter notebook in a few weeks -- but you can run these commands
# to "scrape" text off of the internet.
import urllib.request, urllib.error, urllib.parse, bs4 as bs
source = urllib.request.urlopen('http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33')
soup = bs.BeautifulSoup(source, 'lxml')
print(soup)

# What does the BeautifulSoup package do?  What functions were involved?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Browse - Central Criminal Court</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<link href="css/screen.css" media="screen" rel="stylesheet" type="text/css"/>
<!--<link rel="stylesheet" type="text/css" media="screen" href="a.css" />-->
<!--[if (lte IE 7)&(gte IE 5.5)]><link rel="stylesheet" type="text/css" media="screen" href="css/ie7.css" /><![endif]-->
<!--[if (lte IE 6)&(gte IE 5.5)]><link rel="stylesheet" type="text/css" media="screen" href="css/ie6.css" /><![endif]-->
<!--[if (lt IE 6)&(gte IE 5.5)]><link rel="stylesheet" type="text/css" media="screen" href="css/ie5.css" /><![endif]-->
<link href="css/print.css" media="print" rel="stylesheet" type="text/css"/>
<script src="a.js" type="text/javascript"></script>
<script src="dp.js" type="text/javascript"></script>
<

In [32]:
# What have we mined?
print(soup.title)

<title>Browse - Central Criminal Court</title>


In [33]:
#format the text
text = soup.get_text().lower() 
print(text)



browse - central criminal court














jump to contentjump to main navigationjump to section navigation


the proceedings of the old bailey
london's central criminal court, 1674 to 1913


main navigationhomesearchabout the proceedingshistorical backgrounddatathe projectcontact





 benjamin bowsey. breaking peace: riot. 28th june 1780reference numbert17800628-33verdictguiltysentencedeathrelated material associated recordsactionscite this textold bailey proceedings online (www.oldbaileyonline.org, version 8.0, 17 august 2021), june 1780, trial of                      benjamin                      bowsey                                                                               (t17800628-33).close | print-friendly version | report an errornavigation< previous text (trial account) | next text (trial account) >see original 324.                                                        benjamin                      bowsey                                                           

In [34]:
# We can just split it into words like so
words = text.split() # split it up into words
print(words)

['browse', '-', 'central', 'criminal', 'court', 'jump', 'to', 'contentjump', 'to', 'main', 'navigationjump', 'to', 'section', 'navigation', 'the', 'proceedings', 'of', 'the', 'old', 'bailey', "london's", 'central', 'criminal', 'court,', '1674', 'to', '1913', 'main', 'navigationhomesearchabout', 'the', 'proceedingshistorical', 'backgrounddatathe', 'projectcontact', 'benjamin', 'bowsey.', 'breaking', 'peace:', 'riot.', '28th', 'june', '1780reference', 'numbert17800628-33verdictguiltysentencedeathrelated', 'material', 'associated', 'recordsactionscite', 'this', 'textold', 'bailey', 'proceedings', 'online', '(www.oldbaileyonline.org,', 'version', '8.0,', '17', 'august', '2021),', 'june', '1780,', 'trial', 'of', 'benjamin', 'bowsey', '(t17800628-33).close', '|', 'print-friendly', 'version', '|', 'report', 'an', 'errornavigation<', 'previous', 'text', '(trial', 'account)', '|', 'next', 'text', '(trial', 'account)', '>see', 'original', '324.', 'benjamin', 'bowsey', '(a', 'blackmoor', ')', 'wa

In [35]:
# we can define a set of words that seem to come from the webpage that we don't want to use
extrawords = ["[]", "_gaq.push(['_setaccount'", "browse", "-", "central", "criminal", "court", "var", "_gaq", "=", "_gaq", "||", "[];", 
              "_gaq.push([", "_setaccount", "ua-19174022-1]", "_gaq.push([", "\"_gaq.push(['_setaccount'\"",
              "_trackpageview","(function()", "{", "var", "ga", "=", "document.createelement",
              "'ua-19174022-1'", "_gaq.push(['_trackpageview'",
              "script", "ga.type", "=", "text/javascript", "ga.async", "=", "true;", "ga.src", "=", "https:",
              "==", "document.location.protocol", "?", "https://ssl", ":", "http://www", "+", 
              ".google-analytics.com/ga.js", "var", "s", "=", "document.getelementsbytagname",
              "s.parentnode.insertbefore(ga", "s", "})();", "jump", "to", "contentjump", "to", "main",
              "navigationjump", "section", "navigation",  "proceedings", 
              '(www.oldbaileyonline.org', '8.0', '2020)', '1780',
              "'ua-19174022-1'])", "_gaq.push(['_trackpageview'])", "document.createelement('script')", 
              "'text/javascript'", 'true', "('https:'", "'https://ssl'", '', "'http://www')",
              "'.google-analytics.com/ga.js'", "document.getelementsbytagname('script')[0]", 's)', '})()',
              "bailey",  "1674", "to", "1913", "main", "navigationhomesearchabout",
              "proceedingshistorical", "backgrounddatathe", "projectcontact", "benjamin", "bowsey", 
              "breaking", "peace:", "riot.", "28th", "june", "1780reference", 
              "numbert17800628-33verdictguiltysentencedeathrelated", "material", "associated", 
              "recordsactionscite", "this", "textold", "bailey", "proceedings", "online", 
              '1674-1834', 'api', 'demonstrator', '<!--', 'google_ad_client', 'pub-6166712890256554', 
              '/*', '180x150', 'created', '21/11/08', '*/', 'google_ad_slot', '3829571269', 'google_ad_width',
              '180', 'google_ad_height', '150', '//-->', '<!--', 'google_ad_client', 'pub-6166712890256554', '/*',
              '180x150', 'created', '21/11/08', '*/', 'google_ad_slot', '1983343858', 'google_ad_width', '180', 
              'google_ad_height', '150', '//-->', '<!--', 'google_ad_client', 'pub-6166712890256554', '/*', 
              '180x150', 'created', '21/11/08', '*/', 'google_ad_slot', '9176171409', 'google_ad_width', '180', 
              'google_ad_height', '150', '//-->', 'footer', 'march', '2018', '©', '2003-2018',  
              'www.oldbaileyonline.org', '2020', 't17800628-33).close', "'324'",
              "_gaq.push(['_setaccount", "_gaq.push(['_trackpageview", "document.createelement('script", "document.getelementsbytagname('script')[0",
              "_gaq.push(['_setaccount',", "ua-19174022-1']);", "_gaq.push(['_trackpageview']);", 
              "document.createelement('script');", "text/javascript';", "('https:", "http://www')", 
              ".google-analytics.com/ga.js';", "document.getelementsbytagname('script')[0];", 's.parentnode.insertbefore(ga,', 's);',
               'web', 'site', 'sitemap', 'copyright', '&', 'citation', 'guide', 'visual', 'design', 'technical',
              'design', 'xml', 'feedback', 'ua-19174022-1', "'www.oldbaileyonline.org'", "'2020'", "'t17800628-33).close'", 
              "'ua-19174022-1']", "_gaq.push(['_trackpageview']", 'function', "document.createelement('script'", "'https:'", "'http://www'", '}', 
              "(www.oldbaileyonline.org,", "version", "8.0,", "08", "september", "2020),", "june", "1780,", 
              '"pub-6166712890256554";', '180x150,', '"3829571269";', '180;', '150;', '"pub-6166712890256554";', '180x150,', 
              '"1983343858";', '180;', '150;', '"pub-6166712890256554";', '180x150,', '"9176171409";', '180;', '150;',
              "trial", "of", "benjamin", "bowsey", "(t17800628-33).close", "|", "print-friendly", "version", 
              "|", "report", "errornavigation<", "previous", "text", "(trial", "account)", "|", "next", 
              "text", "(trial", "account)", ">see", "original", "324."]
#print(extrawords)

In [36]:
# we can get rid of the extrawords using a loop
cleantext = []
for w in words:
    cleaned = w.lower().strip('",.;:?([)]') 
    if cleaned not in extrawords: 
        cleantext.append(cleaned)

print(cleantext)


['the', 'the', 'old', "london's", 'the', 'peace', 'riot', '17', 'august', '2021', 'an', 'account', 'account', '324', 'a', 'blackmoor', 'was', 'indicted', 'for', 'that', 'he', 'together', 'with', 'five', 'hundred', 'other', 'persons', 'and', 'more', 'did', 'unlawfully', 'riotously', 'and', 'tumultuously', 'assemble', 'on', 'the', '6th', 'the', 'disturbance', 'the', 'public', 'peace', 'and', 'did', 'begin', 'demolish', 'and', 'pull', 'down', 'the', 'dwelling', 'house', 'richard', 'akerman', 'against', 'the', 'form', 'the', 'statute', '&c', 'rose', 'jennings', 'esq', 'sworn', 'had', 'you', 'any', 'occasion', 'be', 'in', 'part', 'the', 'town', 'on', 'the', '6th', 'in', 'the', 'evening', 'i', 'dined', 'with', 'my', 'brother', 'who', 'lives', 'opposite', 'mr', "akerman's", 'house', 'they', 'attacked', 'mr', "akerman's", 'house', 'precisely', 'at', 'seven', "o'clock", 'they', 'were', 'preceded', 'by', 'a', 'man', 'better', 'dressed', 'than', 'the', 'rest', 'who', 'went', 'up', 'mr', "akerman'

After cleaning, the text looks much better, doesn't it?  There are still a smattering of misplaced 'the's at the beginning of the passage, left over from the webpage heading.  But by and large, you can just read the text like a transcript of the trial at this point, word for word. 

It's important to check the text as you work. At many times throughout the course of a text-mining project,  good practice requires "inspecting" the data set to see how well a particular operation worked.  In this case, we're checking our work cleaning.

One thing that you might notice in the cleaned data is that we have made some choices that a careful viewer might quibble with.  For instance, we have removed double quotes -- " -- but not single quotes, which allows us to retain apostrophes for possessive words such as "london's."  Slight differences of this kind will begin to matter when we start counting each word.  As the text stands now, the number of times "london's" appears will be counted separately from the number of times that "london" appears.  Depending on what you want to count, you might want to count "london's" along with other appearances of "london."    

### Questions for small group discussion

How would you rewrite the code above so that it treated possessive words ("london's", "akerman's") the same as non-possessive nouns ("london", "akerman")?

One advantage of humanist-coder groups is that they can have a rich conversation about the stakes of  counting hyphenated words, possessive words, etc. as separate from the word-stem or not.  Most of those choices depend upon the humanistic questions involved with a text mining project.  For example, if we were looking for all place names, we'd want to count "london's" alongside "london" and "paris's" alongside "paris".  These questions are something to consider in your small groups this week.  How will you treat possessives and similar compounds?  

Your answer to this question will form a small part of your final paper -- perhaps only a sentence and a line of code, but important sentences and lines of code nonetheless.

Consider one example. If we were thinking about the historical composition of houses we might want to treat "dining-room" and "dining room" as the same term; but we also might want to treat "dining room" as a separate  compound from "dining" and "room."  In such a case, we might deliberately add in a hyphen to make "dining" and "room"  match "dining-room" (and the same with "living room," "bed room," "dwelling house," or any other architectural compounds) so that our counts for each kind of room are as accurate as possible.  

* A question for coders: Could you code a loop to find the most common occurences of compounds involving "room" and "house" and replace any that occur more than three times with a hyphenated compound?  

* A question for humanists: Examining the text output above, can you identify any other compounds or potential compounds that  we might want to treat with care, for example, breaking up hyphens or apostrophes, rendering plurals as singular, or identifying compound phrases?  

In [37]:
# Next, let's ask some basic questions about what we just did.

# How many words did we download?
len(words)

2874

In [38]:
# how many words were eliminated after after we cleaned them?
len(words) - len(cleantext)

343



Now, create a Counter object by calling Counter() with the list of things you want to count:


In [39]:
count = Counter(cleantext)

In [40]:
count.most_common(25)

[('the', 195),
 ('i', 105),
 ('was', 71),
 ('in', 62),
 ('a', 52),
 ('and', 51),
 ('he', 50),
 ('you', 50),
 ('that', 40),
 ('his', 38),
 ('it', 36),
 ('did', 34),
 ('on', 27),
 ('him', 27),
 ('at', 26),
 ('house', 23),
 ('had', 23),
 ('my', 21),
 ('there', 21),
 ('any', 20),
 ('yes', 20),
 ('mr', 19),
 ('not', 19),
 ('were', 18),
 ('prisoner', 18)]



As mentioned before, you can use the object as a dictionary to get the count for a particular value:


In [41]:
count['prisoner']

18



You can iterate over this using a for loop to print out just the words:


In [42]:
for word, number in count.most_common(10):
    print(word)

the
i
was
in
a
and
he
you
that
his
