In [2]:
import nltk

## Text Normalization
At least three tasks are commonly applied as part of any normalization process
1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

### Tokenization
1. Regular expression tokenizers
A `RegexTokenizer` splits a string into substrings using a regular expression

In [102]:
from nltk.tokenize import regexp_tokenize
pattern=r'''(?x)
\d+%?
|\w+[-]*\w+
|[a-zA-Z\.]+
|\$?\d+\.\d+
|\.\.\.
|[][.,;"’?!():-_‘]
'''
text = 'That U.S.A. poster-print costs $12.40...52% and more, and one, two, three!'
regexp_tokenize(text, pattern)

['That',
 'U.S.A.',
 'poster-print',
 'costs',
 '$12.40',
 '...',
 '52%',
 'and',
 'more',
 ',',
 'and',
 'one',
 ',',
 'two',
 ',',
 'three',
 '!']

## Byte-pair encoding BPE
Use a kind of tokenization in which most tokens are words, but some tokens are frequent morphemes or other subwords like *-er*, so an unseen word can be represented by combining the parts.

The intuition of the BPE algorithm is to iteratively merge frequent pairs of characters.  

In [107]:
import re
import collections

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i], symbols[i+1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

In [111]:
vocab = {'l o w </w>': 5, 'l o w e s t </w>': 2,
        'n e w e r </w>': 6, 'w i d e r </w>': 3, 'n e w </w>': 2}
num_merges = 16

for i in range(num_merges):
    pairs = get_stats(vocab)
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(best)

('e', 'r')
('er', '</w>')
('n', 'e')
('ne', 'w')
('l', 'o')
('lo', 'w')
('new', 'er</w>')
('low', '</w>')
('w', 'i')
('wi', 'd')
('wid', 'er</w>')
('low', 'e')
('lowe', 's')
('lowes', 't')
('lowest', '</w>')
('new', '</w>')


### Wordpiece and Greedy Tokenization
Like the BPE algorithm, the **wordpiece** algorithm starts with some simple tokenization (such as by whitespaces) into rough words, then breaks those rough word tokens into subword tokens. The **wordpiece** model differs from BPE only in that the special word-boundary token __ appears at the beginning of words rather than at the end, and in the way it merges pairs.

Rather than merging the pairs that are most *frequent*, wordpiece instead merges the pairs that minimizes the language model likelihood of the training data. I.e. it chooses two tokens to combine that would give the training corpus the highest probability.

In [113]:
%%writefile max_match_word_segmenter.py
class MaxMatchWordSegmenter:
    """
    Basic max-match implementation for word segmentation using a given dictionary.
    Tends to have very bad results for English.
    """

    def __init__(self, dictionary):
        """
        :param dictionary: dictionary containing all words that may be in given strings
        """
        self.dictionary = dictionary

    def segment_words(self, string):
        """
        Segments a sentence into words using the max-match algorithm.  This will attempt to greedily find the largest
        words in a sentence, starting at the beginning and moving left-to-right with the remaining string.
        :param string: words without spaces separating them
        :return: list of words that are a word segmentation of the given string
        """
        words = []

        word_begin = 0
        while word_begin < len(string):
            word = self.find_longest_word(string[word_begin:])
            words.append(word)
            word_begin += len(word)

        return words

    def find_longest_word(self, string):
        """
        Finds the longest word that is a prefix of a given string
        :param string: string for which to find the longest word prefix
        :return: longest prefix of the given string, or the first letter of the string if there is no word prefix
        """
        word_end = len(string)
        while word_end > 1:
            test_word = string[:word_end]
            if self.dictionary.is_word(test_word):
                return test_word
            word_end -= 1
        return string[0]

Writing max_match_word_segmenter.py
