# Hist 3368 - Week 2 - Counting words

Next, let's use the *song* list that we made above as the basis for a new pandas 'Series'.  Remember that a Series is just another kind of datatype, in this case one that's very good at counting.  We'll use the pandas command, .Series().  Because it's a pandas command, we have to tell Python to call pandas before calling .Series().  

Also, note the capital **S** in Series.  Pandas has its own punctuation. In this command, capitalization is important.

In [2]:
song = ['row', 'row', 'row', 'your', 'boat', 'gently', 'down', 'the', 'stream']

We saw in a previous notebook that you can do individual counts of words with the .count() function, which works with lists.

In [23]:
song.count('row')

3

In [24]:
song.count('boat')

1

The function .count is useful.  But it always works on one element list at a time. You can use .count() with a for loop as so

In [62]:
songcount = [] # this makes an empty list, 'songcount', which will be filled out in each round of the loop 

for word in song: # a simple for loop, moving through the words of 'song', one word at a time. notice the use of 'for', 'in', and ':' 
    # (remember: 'word' is a dummy variable; we could also call it 'brontosaurus' etc)
    
    wordcount = song.count(word) # this command counts one word in the list for each round of the loop
    songcount.append(wordcount) # this command adds the wordcount from the previous line to the list 'songcount'
    
songcount

[3, 3, 3, 1, 1, 1, 1, 1, 1]

Now we have a list of the counts of each word.  

This method works fine if you have a very short list -- like the variable *song*.  But if you had a document thousands of words long, using .count() wouldn't work very well.  You'd need to work with the data to get it in a form where you know which word has which count.

Fortunately, there are many other functions for counting which are designed to produce usable results quickly.  

In this class, we will often use two other commands to count -- .value_counts() (which works with Series objects) or Counter (which produces a dictionary)

### Counting words with pandas' Series data type

Let's explore two ways of counting.  Each one uses another data type that is not a list.

First, let's look at how you count with a Pandas Series.


Make sure pandas is installed.  We use the "import" command to reach up into the cloud and call down a new software package.  When we use "import" we can also use "as" to give the package a familiar nickname, in this case, 'pd'.

In [28]:
import pandas as pd

In [29]:
seriessong = pd.Series(song)

In [30]:
print(seriessong)

0       row
1       row
2       row
3      your
4      boat
5    gently
6      down
7       the
8    stream
dtype: object


Note that the Series datatype has a particular look. It has row numbers going down the left.  It tells you down below that it is a datatype "object."  Essentially, a pandas Series is a tiny spreadsheet. Our Series, seriessong, is one column wide. 

If we're ever confused, we can ask Python to tell us what type of data we're looking at.

In [31]:
type(seriessong)

pandas.core.series.Series

Pandas is really good at counting Series quickly, using the command *.value_counts().  

In [32]:
seriessong.value_counts()

row       3
the       1
stream    1
your      1
down      1
gently    1
boat      1
dtype: int64

Because we'll want to call value_counts again, this time let's save the results as a new variable, *songcounts.* 

What type of data is songcounts? Let's ask.

In [33]:
songcounts = seriessong.value_counts()

In [34]:
type(songcounts)

pandas.core.series.Series


We can navigate the Series the same way we do a list -- with square brackets.  

We can call the first member of songcounts by using square brackets and the number 0, which you will remember is how Python classifies the first item in a list.

In [35]:
songcounts[0]

3

We can use the function .get() to call the value of the pandas Series if we know the word in question

In [36]:
songcounts.get('row')

3

In [37]:
songcounts.get('boat')

1

In [38]:
songcounts.get('your')

1

We can use square brackets to ask Python for more information.

In this case, we put inside the square brackets the conditions we want to meet:  "show us the parts of songcounts where the value of songcounts is 3."

In [39]:
songcounts[songcounts == 3]

row    3
dtype: int64

Note that when we called .value_counts, the names of the words were stored in the pandas space known as the 'index'.

We can always get to that 'index' by using the function .index.

.index produces a list of values which is the axis of the pandas Series.

In [40]:
songcounts.index

Index(['row', 'the', 'stream', 'your', 'down', 'gently', 'boat'], dtype='object')

We can navigate this as we would a list.  .index[0] produces the first item in the list.

In [41]:
songcounts.index[0]

'row'

Try navigating the Pandas series song_counts some more.

In [42]:
songcounts[3]

1

In [43]:
songcounts[-1]

1

In [44]:
songcounts.index[-1]

'boat'

In [45]:
songcounts[-3:]

down      1
gently    1
boat      1
dtype: int64

### Counting words with the 'Dictionary' data type

We can also count the words in a list using a dictionary, using the 'collections' software package, which has a function called 'Counter'. 

The function 'Counter' produces word counts in the format of a dictionary.

In general, we'll use collections and Counter less frequently than we'll use pandas Series and the function .value_counts().  It's still useful to know that there are many ways to count in Python.  

Test your memory of how to navigate a dictionary object by investigating the use of Counter applied to the variable *song*.

In [46]:
from collections import Counter

In [47]:
Counter(song)

Counter({'row': 3,
         'your': 1,
         'boat': 1,
         'gently': 1,
         'down': 1,
         'the': 1,
         'stream': 1})

Can we navigate to the count for one word?

In [48]:
songcounts = Counter(song)

In [49]:
songcounts[3]

0

That doesn't appear to work the way we might have thought if we were working from a list. Why not?  The answer is that we are dealing data structured as a *dictionary*.  

Dictionaries expect you to call the 'key' of the dictionary in square brackets.  Then they will return the value.  

In this case, the keys are words and the values are counts.

In [50]:
songcounts['row']

3

This variation does the same thing:

In [51]:
songcounts.get('row')

3

You'll get a 'key error' -- or a value of zero -- if you try looking up a key that doesn't exist.

In [52]:
songcounts['banana']

0

You can use the 'in' operator to check if a particular item is found in a dictionary.

In [53]:
'banana' in songcounts

False

In [54]:
'row' in songcounts

True

What about looking up a key based on a value?  Well, multiple keys may have the same value.  So there's no easy command for it. This is important: *not every data type is as easy to navigate as every other data type.*



You can navigate dictionaries in other ways.

We could call all the keys with .keys()

In [55]:
songkeys = songcounts.keys()
songkeys

dict_keys(['row', 'your', 'boat', 'gently', 'down', 'the', 'stream'])

Or all the values with .values()

In [56]:
songvalues = songcounts.values()
songvalues

dict_values([3, 1, 1, 1, 1, 1, 1])

# Let's count with some real text from the internet.

Very often in this class we will give you the code to pull text from a historical source.  We will not teach much about how the text is formatted or what the commands are that pull the text in.  Instead, we're assuming that you'll just run the lines below to get the text required.  

Then, we'll apply the value_counts() function to the text and see the results of counting words from a real, historical text.

**Instructions**: Just run the following line to get a historical text from a repository of eighteenth-century court cases.  We'll print the first 500 words so that you can see what you have.

In [57]:
# Import some text from online.  This isn't a step that you'll need to repeat -- since you'll be
# getting pre-packaged data in a Jupyter notebook in a few weeks -- but you can run these commands
# to "scrape" text off of the internet.
import urllib.request, urllib.error, urllib.parse, bs4 as bs
source = urllib.request.urlopen('http://www.oldbaileyonline.org/browse.jsp?id=t17800628-33&div=t17800628-33')
soup = bs.BeautifulSoup(source, 'lxml')
text = soup.get_text().lower() 
words = text.split() # split it up into words
print(words[:500])

['browse', '-', 'central', 'criminal', 'court', 'jump', 'to', 'contentjump', 'to', 'main', 'navigationjump', 'to', 'section', 'navigation', 'the', 'proceedings', 'of', 'the', 'old', 'bailey', "london's", 'central', 'criminal', 'court,', '1674', 'to', '1913', 'main', 'navigationhomesearchabout', 'the', 'proceedingshistorical', 'backgrounddatathe', 'projectcontact', 'benjamin', 'bowsey.', 'breaking', 'peace:', 'riot.', '28th', 'june', '1780reference', 'numbert17800628-33verdictguiltysentencedeathrelated', 'material', 'associated', 'recordsactionscite', 'this', 'textold', 'bailey', 'proceedings', 'online', '(www.oldbaileyonline.org,', 'version', '8.0,', '31', 'august', '2021),', 'june', '1780,', 'trial', 'of', 'benjamin', 'bowsey', '(t17800628-33).close', '|', 'print-friendly', 'version', '|', 'report', 'an', 'errornavigation<', 'previous', 'text', '(trial', 'account)', '|', 'next', 'text', '(trial', 'account)', '>see', 'original', '324.', 'benjamin', 'bowsey', '(a', 'blackmoor', ')', 'wa

Next we use Pandas' value_counts() function to count the words.

In [58]:
import pandas as pd
count = pd.Series.value_counts(words)

We can use square brackets to look at just the most prevalent words

In [59]:
count[:20]

the     195
i       105
to       78
-        77
was      68
of       64
in       59
and      51
he       50
a        50
you      49
that     37
his      37
it       30
did      29
at       26
on       26
had      21
my       21
mr.      19
dtype: int64



As mentioned before, you can look up the value of certain words using square brackets.

In [60]:
count['prisoner']

9

You can also look at the least frequent words

In [61]:
count[-20:]

town,       1
legs,       1
look        1
you?        1
way;        1
fire,       1
cry         1
sailor's    1
door;       1
lady        1
makes       1
did,        1
windows,    1
7th         1
bell        1
himself     1
doing.      1
four        1
(b          1
statute,    1
dtype: int64

### Assignment

Create a variable called by your last name. Write out the lyrics to a new poem or song of your choice of at least five lines as a list.

* Use the Series method with the function .value_counts() to count the words in the song. 
     * Write out the code to navigate this list:
        * What is the first item in the Series?
        * What is the last item in the Series?
        
* Use the Dictionary method and the function Counter() to count the words in the song.
     * Write out the code to navigate this list:
        * How many times does the word 'the' appear in the song?
        
Take a screenshot just of your code and results. Upload it to Canvas.


