# A simple word tokenizer

by Koenraad De Smedt at UiB

---
Word *tokenizing* (or *tokenization*) is the process of splitting text in units called *tokens*. Each word is a token. Spaces are disregarded. Punctuation may either be disregarded or may also be recognized as tokens.

Tokenization is often a useful first step in working with digital text. At the same time, some *normalization* of the text may be performed. Normalization may include case folding, error correction and *lemmatization* (reducing words to their dictionary form).

Based on a list of tokens, one can easily compute the *set of types*, i.e. unique tokens.

This notebook shows how to:

1.  Define a simple word tokenizer based on a regular expression
2.  Refine the tokenizer depending on what is considered a word
3.  Compute the types, i.e. the set of unique tokens.

---

A simple word tokenizer can be made by splitting a string, using any sequence of non-word characters as a separator.

In [None]:
import re

def word_tokens (text):
  return re.split(r'\W+', text)


Let’s make a string with a short example text (based on [a popular film](https://g.co/kgs/g3veEe) script), and test.



In [None]:
story ='''Once upon a time, there was a princess called Buttercup. She
had a farm-hand called Westley; whenever she tells him to do something,
e.g., he always answers: "As you wish." At first she didn't realize he 
loves her...'''

word_tokens(story)

This obtains a list of word occurrences, but it has a few shortcomings:

1.   There is an empty string at the end of the list because the string ends in a separator. Empty strings are however easy to remove.
2.   Hyphenated words and abbreviations are split. This is not desirable.
3.   Contractions such as *didn't* are split inappropriately, because apostrophes are part of the `\W` category.

A slightly better solution is to match sequences of \w (alphanumerics) and a few other characters, using `findall` instead of `split`.

In [None]:
def word_tokens (text):
  return re.findall(r'[\w\'’-]+', text)

tokens = word_tokens(story)
tokens

This is better, but not perfect. It still does not handle abbreviations with periods, for example. It keeps contractions together, but does not distinguish between an apostrophe in a contraction and the same character used as a single quote around words, unless one actually uses different characters for these purposes. There may also be other character ambiguities, depending on the language. Writing a foolproof tokenizer is difficult, in part because there is no simple definition of what a word is.

The set of word tokens, i.e. all unique tokens, provide the word *types*.

In [None]:
set(tokens)

### Exercises

1.   Tokenization is often combined with normalization. Use the tokenizer on *casefolded* text.
2.   This simple tokenizer does not make tokens for punctuation. Is this good or bad? 
3.   What would be other linguistically motivated ways of handling *didn't*?
4.   Count the number of tokens.
5.   How would you compute the word *types*, in other words, the set of *different* words in a text?
6.   Using the palindrome test from an earlier notebook, check if a list of tokens is palindromic. Test with sentences like `'Fall leaves as soon as leaves fall.'`
