## Text Normalization
At least three tasks are commonly applied as part of any normalization process
1. **Tokenizing** (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

This notebook focuses on tokenization.

### 1. Regular expression tokenizers
A `RegexTokenizer` splits a string into substrings using a regular expression

In [3]:
from nltk.tokenize import regexp_tokenize
pattern=r'''(?x)     # set flag to allow verbose regexps
\d+%?                # percentages
|\w+[-]*\w+          # words with optional internal hyphens
|[a-zA-Z\.]+         # abbreviations, e.g. U.S.A.
|\$?\d+\.\d+         # currency
|\.\.\.              # ellipsis
|[][.,;"’?!():-_‘]   # these are separate tokens
'''
text = 'That U.S.A. poster-print costs $12.40...52% and more, and one, two, three!'
regexp_tokenize(text, pattern)

['That',
 'U.S.A.',
 'poster-print',
 'costs',
 '$12.40',
 '...',
 '52%',
 'and',
 'more',
 ',',
 'and',
 'one',
 ',',
 'two',
 ',',
 'three',
 '!']

### 2. Byte-pair encoding BPE
Use a kind of tokenization in which most tokens are words, but some tokens are frequent morphemes or other subwords like *-er*, so an unseen word can be represented by combining the parts.

The intuition of the BPE algorithm is to iteratively merge frequent pairs of characters.  

In [107]:
import re
import collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

In [111]:
vocab = {'l o w </w>': 5, 'l o w e s t </w>': 2,
        'n e w e r </w>': 6, 'w i d e r </w>': 3, 'n e w </w>': 2}
num_merges = 16

for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)

('e', 'r')
('er', '</w>')
('n', 'e')
('ne', 'w')
('l', 'o')
('lo', 'w')
('new', 'er</w>')
('low', '</w>')
('w', 'i')
('wi', 'd')
('wid', 'er</w>')
('low', 'e')
('lowe', 's')
('lowes', 't')
('lowest', '</w>')
('new', '</w>')


### 3. Wordpiece and Greedy Tokenization
Like the BPE algorithm, the **wordpiece** algorithm starts with some simple tokenization (such as by whitespaces) into rough words, then breaks those rough word tokens into subword tokens. The **wordpiece** model differs from BPE only in that the special word-boundary token __ appears at the beginning of words rather than at the end, and in the way it merges pairs.

Rather than merging the pairs that are most *frequent*, wordpiece instead merges the pairs that minimizes the language model likelihood of the training data. I.e. it chooses two tokens to combine that would give the training corpus the highest probability.

In [71]:
def maxmatch(string, dictionary, gather):
    word_end = len(string)
    while word_end > 1:
        firstword = string[:word_end]
        remainder = string[word_end:]
        if firstword in dictionary:
            gather.append(firstword)
            return maxmatch(remainder, dictionary, gather)
        word_end -= 1
    return gather

In [61]:
dictionary = ["day", "in", "tent", "intent", "happy", "##tent", "##tention", "##tion", "#ion"]

In [72]:
maxmatch('happyday', dictionary, [])

['happy', 'day']

In [73]:
maxmatch('intention', dictionary, [])

['intent']

### 4. Tokenizer in spaCy
SpaCy's tokenizer is a rules-based system. Tokenization is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas “U.K.” should remain one token. Each Doc consists of individual tokens, and we can iterate over them:

In [1]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [2]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text)

Apple
is
looking
at
buying
U.K.
startup
for
$
1
billion


First, the raw text is split on whitespace characters, similar to `text.split(' ')`. Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

1. Does the substring match a tokenizer exception rule? For example, “don’t” does not contain whitespace, but should be split into two tokens, `“do”` and `“n’t”`, while `“U.K.”` should always remain one token.
2. Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.

If there’s a match, the rule is applied and the tokenizer continues its loop, starting with the newly split substrings. This way, spaCy can split **complex, nested tokens** like combinations of abbreviations and multiple punctuation marks.

<img src="spacy_tokenization.svg" width=400>

Tokenizer exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.
```
Prefix: Character(s) at the beginning, e.g. $, (, “, ¿.
Suffix: Character(s) at the end, e.g. km, ), ”, !.
Infix: Character(s) in between, e.g. -, --, /, ….
```