# Word tokenization and frequencies with NLTK

by Koenraad De Smedt at UiB

---

This notebook will introduce [NLTK](https://www.nltk.org/), a Natural Language ToolKit. This notebook will show how to do the following with this toolkit:

1.  How to *word-tokenize* a text, i.e. make a list of word tokens extracted from a string (including tokens for punctuation).
2.  How to compute the *types*, i.e. the set of unique tokens.
3.  How to make a *frequency list* of tokens and find how many times a token occurs in the text.

In later notebooks, these techniques will be applied to larger texts read from the Web.

For those who want to know more on NLTK, there is an [online book](https://www.nltk.org/book/).

---

Here is an example string for a piece of text.

In [None]:
story ='''Once upon a time, there was a princess called Buttercup. She had a
farm-hand called Westley; whenever she tells him to do something, he always
answers: "As you wish." At first she didn't realize he loves her, but
eventually she realizes it and she loves him too! Westley leaves to seek his
fortune overseas so they can marry. When his ship is attacked by the Dread
Pirate Roberts, who is infamous for never leaving survivors, Westley is
presumed dead...'''

The NLTK module provides several useful functions for manipulating text. See the [documentation](https://www.nltk.org/).

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, FreqDist

## Tokenization

NLTK provides the function `word_tokenize` which extracts all tokens and returns a list. This tokenizer is somewhat more sophisticated than the simple tokenizer from an earlier notebook. Hyphenated words are kept together. Punctuation is split off and tokens for punctuation are included in the list.

In [None]:
tokens = word_tokenize(story)
tokens

Use `casefold` if you want to convert everything to lowercase. This may have advantages and disadvantages.

In [None]:
tokens = word_tokenize(story.casefold())
tokens

Make the vocabulary, i.e. the word types, by computing the set of unique tokens in the text.

In [None]:
vocab = set(tokens)
vocab

Print the length of the text and the length of the vocabulary.

In [None]:
print(len(tokens), len(vocab))

Make a list of types with more than four characters.

In [None]:
[word for word in vocab if len(word) > 4]

## Word frequencies

Make a frequency distribution of the tokens. This produces a kind of *counter*, which is a special *dict* in which each token is associated with its frequency.

In [None]:
freq_dist = FreqDist(tokens)
freq_dist

We can find the frequency by using the token as a key.

In [None]:
freq_dist['she']

Compute frequencies sorted by decreasing numbers. This produces a list of tuples.

In [None]:
freq_list = freq_dist.most_common()
freq_list

Print the nine most common tokens with their frequencies.

In [None]:
for (token, freq) in freq_dist.most_common(9):
  print(freq, ':', token)

### Exercises

1.   Compute the lexical variation, i.e. the ratio of types to tokens.
2.   Print the nine most common tokens with their frequencies, but also print the length of each token on the same line.
3.   Instead of printing in the previous exercise, use a comprehension to make a list of triples containing the frequency, the token and the token's length.
4.   Print the five most common tokens with at least three characters. The easiest is to first use a comprehension that includes tokens with at least three characters, and then make a frequency distribution of that list.
5.   (optional) For some purposes, one wants a list containing only words, excluding tokens that consist of just punctuation marks. What needs to be done to keep only words? Tip: if you `import string`, you get the variable `string.punctuation` which has all punctuation marks, so you can write a function that checks if every character of a token is also in `string.punctuation`, and if so, it is not kept.
6.   (optional) NLTK can also provide a list of stopwords, see the cell below. Compute the set of words in `vocab` which are not stopwords. You can use `-` for set difference.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
print(stopwords.words('english')[:12])