# Counting words and phrases

In [None]:
import requests
from textblob import TextBlob
import re
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt

I've harvested [a gigabyte of OCRd text](https://glam-workbench.github.io/trove-books/#ocrd-text-from-trove-books-and-ephemera) from Trove's digitised books and shared it through Cloudstor. Here we'll explore [*Australian Plain Cookery by a Practical Cook*](https://nla.gov.au/nla.obj-579917051) from 1882. However, you could change the `text_file` value below to point to any of the other books on Cloudstor. There's a complete list [in this CSV file](https://github.com/GLAM-Workbench/trove-books/blob/master/trove_digitised_books_with_ocr.csv).

In [None]:
CLOUDSTOR_URL = 'https://cloudstor.aarnet.edu.au/plus/s/ugiw3gdijSKaoTL'
text_file = 'australian-plain-cookery-by-a-practical-cook-nla.obj-579917051.txt'
stop_words = stopwords.words('english')

In [None]:
# Get the text file from Cloudstor
response = requests.get(f'{CLOUDSTOR_URL}/download?files={text_file}')
text = response.text

## Counting words

One way of getting a sense of what a piece of text is about is to look at the frequencies with which words appear. You don't need any special software to do basic word counts. You can just split the text into individual words (called tokens) using a regular expression – in the case below, `\w+` looks for groups of alphanumeric characters, separating words from punctuation and spaces. The you can use `Counter` to find the frequency of each word and `.most_common()` to rank them.

In [None]:
words = re.findall(r'\w+', text.lower())
Counter(words).most_common(10)

<div class="alert alert-info"><img src="../images/hhicon.png" width="50px" style="vertical-align: bottom; margin-right: 10px;">Try changing the number in the brackets of <code>.most_common()</code>. When you've finished reemember to run the cell again using <b>Shift+Enter</b>!</div>

No suprise that the most common words are things like 'the' and 'of'. To try and focus on words that are distinctive of this particular text we can filter our list of words to remove the most common words in English – we call these 'stopwords'. In the cells above we imported a list of stopwords from the NLTK package. Here we'll drop any words that appear in the stopwords and recount.

In [None]:
filtered_words = [w for w in words if w not in stop_words]
Counter(filtered_words).most_common(10)

That looks a bit more interesting.

## Displaying a word cloud

Word clouds are a familiar way of visualising word frequencies in a text. The Wordcloud package makes it easy to generate them. Note that Wordcloud takes care of all the tokenisation, counting, and stopword removal for us – we just give it the text and display the results.

In [None]:
# First we make the wordcloud
# See how we feed the text variable into the generate function?
wc = WordCloud(width=600, height=300).generate(text)

# Then we display the wordcloud using matplotlib
plt.figure( figsize=(10,5) )
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')

You might notice that the wordcloud includes some two word phrases (called bigrams) such as 'an hour'. That's the packages default setting, but we can exclude them by setting `collocations=False`.

In [None]:
wc = WordCloud(width=600, height=300, collocations=False).generate(text)
plt.figure( figsize=(10,5) )
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')

## Word and n-gram frequencies

To go further in analysing texts, you'll probably want to use a specialised package. There's NLTK, Spacey, and many others. Here we'll use TextBlob, which is probably the simplest to use and understand.

Finding word frequencies with TextBlob is as simple as accessing `blob.word_counts`. However, we probably want to do a bit of filtering to remove some of the noise introduced by OCR. As before, we can also remove common stop words like 'a', 'the', and 'and'.

The cell below only keeps words that are longer than two characters (`len(w) > 2`), contain only alphabetical characters (`re.match(r'^[a-zA-Z \-]*$', w)`) and are not in the standard list of stop words.

Once we have our list of words/frequencies we'll use `Counter` to get the 10 most common words.

In [None]:
# First we load our text into TextBlob
blob = TextBlob(text)

In [None]:
# The we get the word counts, filter them, and display the most common
words = {w: c for w, c in dict(blob.word_counts).items() if len(w) > 2 and re.match(r'^[a-zA-Z \-]*$', w) and w not in stop_words}
Counter(words).most_common(10)

If you compare this list with the one above, you'll see they're very similar, but not eaxctly the same. Why? Most likely it's because of differences in the way words are extracted from the text. Above we used a very simple regular expression to identify words, but TextBlob uses a [more sophisticated](https://textblob.readthedocs.io/en/dev/_modules/textblob/tokenizers.html) set of rules.

TextBlob will also break our text up into multiple word phrases – *n*-grams, where 'n' represents the number of words. We saw some bigrams in the wordcloud above, but let's find more. To get a list of bigrams we just use `blob.ngrams(2)`. We can then use `Counter` to find the most common.

In [None]:
blob.ngrams(2)[:10]

We can then join the separate parts of the ngram together (using `.join`) and use Counter to find the most common phrases.

In [None]:
Counter([' '.join(l) for l in blob.ngrams(2)]).most_common(10)

We can try trigrams as well!

In [None]:
Counter([' '.join(l) for l in blob.ngrams(3)]).most_common(10)

The most common trigrams seem to relate to quantities – not so surprising in a recipe book! Let's focus in on this by looking at 4-grams (4 word phrases) that start with some of the common trigrams.

In [None]:
Counter([' '.join(l) for l in blob.ngrams(4) if ' '.join(l).startswith('a pound of')]).most_common(10)

In [None]:
Counter([' '.join(l) for l in blob.ngrams(4) if ' '.join(l).startswith('a pint of')]).most_common(10)

In [None]:
Counter([' '.join(l) for l in blob.ngrams(4) if ' '.join(l).startswith('a tablespoonful of')]).most_common(10)

<div class="alert alert-info"><img src="../images/hhicon.png" width="50px" style="vertical-align: bottom; margin-right: 10px;">A challenge! Using the examples above, can you find the 10 most common 4-grams? What about 5-grams?</div>

## What's next

You can also use TextBlob to extract more complex structures such as parts of speech. For an example, see the [Recipe Generator](recipe-generator.ipynb)!