# Simple word counting



To reason about unstructured data, we need to somehow convert words, or pixels, into numbers. What this does to the *meaning* of words is an interesting question that we'll slowly unfold over the course of the semester.

But first, just practically, how would you do it? 

Let's practice several different methods.

In [2]:
import re
from collections import Counter
from pathlib import Path

## Basic file i/o

Often we'll need to read in a file. If the file is structured as a table, we'll use a special module called "Pandas" to do that. But in the case of simple text files we need to understand a little about the way the file is represented on disk. For instance, it matters whether we want to treat the whole file as a single string, or treat it as a sequence of strings separated by line breaks.

In [4]:
prosepath = Path('../../texts/Milne_ALostMasterpiece.txt') 

# We could probably just use a string there instead of a Path
# object. But I'm being exaggeratedly cautious to ensure
# I create habits that will work on Windows machines, where
# the slashes go the other way. A Path object automatically
# adjusts for your operating system.

# It doesn't really matter whether you use single quotes or
# double quotes to enclose strings. I use single because
# I'm lazy about hitting the shift key.

paragraphs = open(prosepath, encoding = 'utf-8').readlines()

# The `open` function creates a file object, which has "methods"
# you can call. Those are like functions attached to the object.
# object-period-method() is the typical syntax for calling a
# method in Python. The function may or may not have arguments 
# inside the parens.

# `readlines` is a method that returns a list of strings,
# breaking whenever it hits a line break. In this file
# line breaks exist only between paragraphs.

In [5]:
len(paragraphs)

23

In [6]:
paragraphs[0]      # The first paragraph. Python starts counting at zero.

'The short essay on “The Improbability of the Infinite” which I was planning for you yesterday will now never be written. Last night my brain was crammed with lofty thoughts on the subject--and for that matter, on every other subject. My mind was never so fertile. Ten thousand words on any theme from Tin-tacks to Tomatoes would have been easy to me. That was last night. This morning I have only one word in my brain, and I cannot get rid of it. The word is “Teralbay.”\n'

Notice the newline character at the end of that string, represented with the two character sequence '\n.' Otherwise it would be invisible: it would just look like a linebreak. Also notice that this text uses fancy curly quotes “”; those are different characters from ordinary "".

**Alternate .read() method**

It's also possible to read a whole file as a single string, using the ```.read()``` method.

In [8]:
poempath = Path('../../texts/Libai_ThoughtsInTheSilentNight.txt')
fullpoem = open(poempath, encoding = 'utf-8').read()
fullpoem

'床前明月光，\n疑是地上霜。\n举头望明月，\n低头思故乡。\n'

You can see that all the lines of this four-line poem are contained in a single variable. UTF-8 is the most common character encoding, but you will sometimes encounter files in a different character encoding--a different way of translating bytes into characters. If you get a message that says the start byte is not readable in UTF-8, or if special characters like é behave strangely, that's a likely explanation. If you save a file in Excel it may change the encoding without informing you--at least that *used* to happen regularly.

## Counting words; using dictionaries

Let's count the words in "A Lost Masterpiece." There's a very simple way to do this, but let's take the long way around first to demonstrate a few things.

A "dictionary" is a data structure that allows you to map one set of variables onto another set. Each of the "values" in the dictionary is referenced by a unique "key."

In [9]:
randomthoughts = dict()
randomthoughts['banana'] = 'only good before ripe'
randomthoughts[16] = 'four squared'
randomthoughts['16'] = randomthoughts['banana']

In [10]:
randomthoughts[16]

'four squared'

In [11]:
randomthoughts['16']

'only good before ripe'

So one way to count the words in "A Lost Masterpiece" would be to split each line into words, then create a dictionary that keeps track of the number of times each word appears.

In [12]:
wordsinfirstpar = paragraphs[0].split() # this splits a string using any white space character
wordsinfirstpar[0: 20]

['The',
 'short',
 'essay',
 'on',
 '“The',
 'Improbability',
 'of',
 'the',
 'Infinite”',
 'which',
 'I',
 'was',
 'planning',
 'for',
 'you',
 'yesterday',
 'will',
 'now',
 'never',
 'be']

In [13]:
naivewordcounts = dict()

for w in wordsinfirstpar:            # this is a for-loop
    if w not in naivewordcounts:     # this is an if-then statement
        naivewordcounts[w] = 0       # notice how indentation works:
                                     # the indented lines are executed if the condition is true 
    naivewordcounts[w] = naivewordcounts[w] + 1  # or you could just say naivewordcounts[w] += 1

In [14]:
naivewordcounts['the']

2

### why is that wrong?

Huh. I can see that the word 'the' appears more than three times in that paragraph.

What's the problem? Why is my count off?

### let's solve the problem

I'm going to define a function that will do a better job of "splitting a text into words."

This function, by the way, is borrowed from Melanie Walsh.

In [15]:
def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

In [16]:
wordsinfirstpar = split_into_words(paragraphs[0])
wordsinfirstpar[0 : 20]

['the',
 'short',
 'essay',
 'on',
 'the',
 'improbability',
 'of',
 'the',
 'infinite',
 'which',
 'i',
 'was',
 'planning',
 'for',
 'you',
 'yesterday',
 'will',
 'now',
 'never',
 'be']

In [17]:
# Now we can count occurrences of 'the' in the first paragraph
# using this very simple list method:

wordsinfirstpar.count('the')

5

### why does that work?

Walsh's function relies on *regular expressions.* If the meaning of '\W+' is not clear to you, that's normal. No one ever remembers how a particular regular expression works. To understand what's happening in a regex, I always have to check out [Regex101](https://regex101.com) and play around. Let's do that.

Then come back and explain (a) how Walsh's function works, and
(b) think of cases where it will fail to break "words" in the places we might ordinarily expect.

## Write code that counts all the words in Milne's story

Remember that we currently have the story as a list of separate "paragraphs." The variable name is ```paragraphs```.

And for right now humor me and write this using a dictionary. There's an easier way to do it, which I'm about to admit.

But not everyone in the class is familiar with Python, so the syntax of loops and if-thens (and indentation) is worth exploring.

In [18]:
# your code goes here


## Probably the easier way

A Counter is a dictionary that automatically initializes at zero if you use a new key.

Also you can initialize it directly from a list.

So an easier way to do the wordcounting is:

In [19]:
fulltext = open(prosepath, encoding = 'utf-8').read()
allwords = split_into_words(fulltext)
wordcounts = Counter(allwords)
wordcounts['the']

40

## An even easier way

In [37]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
vectorizer = CountVectorizer(max_features = 20)
sparse_counts = vectorizer.fit_transform(paragraphs) # the vectorizer produces something
                                                               # called a 'sparse matrix'; we need to
                                                               # unpack it

In [31]:
sparse_counts

<23x20 sparse matrix of type '<class 'numpy.int64'>'
	with 142 stored elements in Compressed Sparse Row format>

In [32]:
sparse_counts.toarray()

array([[2, 1, 2, 2, 1, 1, 1, 0, 3, 0, 2, 0, 0, 1, 2, 5, 1, 2, 2, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [4, 2, 1, 1, 1, 4, 5, 0, 1, 2, 5, 1, 0, 2, 2, 5, 6, 6, 5, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [3, 4, 1, 0, 1, 4, 9, 2, 0, 1, 3, 3, 0, 1, 1, 2, 4, 2, 0, 6],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [2, 2, 2, 2, 1, 1, 5, 3, 1, 1, 3, 0, 0, 1, 6, 3, 1, 2, 1, 7],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [3, 0, 0, 2, 2, 3, 2, 0, 0, 4, 2, 2, 0, 0, 3, 2, 2, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 2, 0, 0, 0, 1, 1, 0, 2, 0, 0, 0, 4, 5, 1, 3, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 2, 2, 4, 0, 0,

In [35]:
vectorizer.get_feature_names()[0:10]

['and', 'be', 'for', 'have', 'in', 'is', 'it', 'may', 'my', 'not']

In [40]:
doc_term = pd.DataFrame(sparse_counts.toarray(), index = list(range(len(paragraphs))), 
                            columns = vectorizer.get_feature_names())

In [41]:
doc_term

Unnamed: 0,and,be,for,have,in,is,it,may,my,not,of,or,shall,teralbay,that,the,this,to,word,you
0,2,1,2,2,1,1,1,0,3,0,2,0,0,1,2,5,1,2,2,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,4,2,1,1,1,4,5,0,1,2,5,1,0,2,2,5,6,6,5,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,3,4,1,0,1,4,9,2,0,1,3,3,0,1,1,2,4,2,0,6
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,2,2,2,2,1,1,5,3,1,1,3,0,0,1,6,3,1,2,1,7
7,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,3,0,0,2,2,3,2,0,0,4,2,2,0,0,3,2,2,1,0,0
9,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [42]:
doc_term['teralbay']

0     1
1     0
2     2
3     0
4     1
5     0
6     1
7     0
8     0
9     0
10    0
11    0
12    5
13    0
14    0
15    0
16    0
17    0
18    1
19    0
20    1
21    0
22    0
Name: teralbay, dtype: int64

In [46]:
rowsums = doc_term.sum(axis = 'columns') # notice, not intuitive!! opposite of intuitive!
len(rowsums)

23