# Regular Expressions, Text Normalization, Edit Distance

**text normalization** - converting text to a more convenient, standard form

**lemmatization** - the task of determining that two words have the same root, despite their surface differences
- sang, sung, and sings are forms of the verb sing
- lemmatizer (a function) maps these words to their lemma, sing

**sentence segmentation** - breaking up a text into individual sentences, using cues like periods or exclamation points

**edit distance** - metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other

### regular expressions

a language for specifying text search strings

**quick regex review:**
 
disjunction
- `[wW]oodchuck` - Woodchuck or woodchuck
- `[abc]`  - ‘a’, ‘b’, or ‘c’
- `gupp(y|ies)` - guppy or guppies

range and `^` as *negation*
- `[0-9]`  - a single digit 0-9
- `[ ˆA-Z]` - not an upper case letter

optional elements: `?`
- `colou?r` - color or colour

kleene star - zero or more occurrences of the immediately previous character or regular expression
- `[ab]*` - aaaa, ababab, bbbb

wildcard `.`
- `beg.n`: begin, beg’n, begun

anchors
- `ˆThe box\.$` - a line that contains only the phrase `The box`
- /\bthe\b/ - `the` (but not the word other)

### words

**corpus** - a computer-readable collection of text or speech

**utterance** - the spoken correlate of a sentence

*I do uh main- mainly business data processing*

- disfluencies occur in spoken sentences
    - uh and um are called fillers
    - sometimes these helpful because they may signal the restart of a clause or idea

**lemma** - is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense
- box, boxes


### Text Normalization

1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

unix example of tokenizing a text file

`tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r`

- changes every sequence of nonalphabetic characters to a newline
- \-c option complements to non-alphabet
- \-s option squeezes all sequences into a single character
- \-n option sorts numerically rather than alphabetically
- \-r option means to sort in reverse order

result:
    
    27378 the
    26084 and
    22538 i
    19771 to
    17481 of
    14725 a
    13826 you
    12489 my
    11318 that
    11112 in
    

**function words** - articles, pronouns, prepositions, the most frequent corpora

**named entity detection** - the task of detecting names, dates, and organizations 


**morpheme** - the smallest meaning-bearing unit of a language

## tokenization
What's the point of tokenization?

- it’s helpful to have subword tokens to deal with unknown words
- ML systems learn facts about words in a training corpus and then use that to make decisions about a test corpus
-  if our training corpus contains, say the words low, and lowest, but not lower, but then the word "lower" appears in our test corpus, our system will be able to combine tokens from the training corpus to understand "lower"
- most tokens are words, but some tokens are frequent morphemes or other subwords like -er, so that an unseen word can be represented by combining the parts


### byte-pair encoding for tokenization

based on a method for text compression, the intuition of the algorithm is to iteratively merge frequent pairs of characters

algorithm:

>1. Initialize vocabulary with the set of symbols equal to all characters seen plus a "_"
>2. Represent each word in the corpus as a combination of the characters along with the special end of word token </w>.
>3. Iteratively count character pairs in all tokens of the vocabulary.
>4. Merge every occurrence of the most frequent pair, add the new character n-gram to the vocabulary.
>5. Repeat step 3 and 4 until the desired number of merge operations are completed or the desired vocabulary size is achieved (which is a hyperparameter).



# Byte-Pair Tokenizing A Midsummer Night's Dream

my implementation of the byte-pair encoding algorithm for tokenization
- inputs: dictionary of words with frequencies of those words
- K: number of iterations to run the algorithm

In [62]:
import re
with open('midsummer-nights-dream.txt', 'r') as file:
    corpus = file.read()
corpus = corpus.replace('\n', ' ').replace('\r', '')
print(corpus)



In [54]:
import re
from collections import Counter, defaultdict

In [55]:
def build_vocab(corpus):
    '''
    Initialize vocabulary, add special char </w> to each word
    ''' 
    chars = [" ".join(word) + " </w>" for word in corpus.split()]
    
    # Count frequency of chars in corpus
    vocab = Counter(chars)

    return vocab

In [56]:
vocab = build_vocab(corpus)
print(vocab)

Counter({'t h e </w>': 482, 'I </w>': 412, 'a n d </w>': 365, 't o </w>': 270, 'o f </w>': 248, 'a </w>': 233, 'i n </w>': 205, 'A n d </w>': 184, 'y o u </w>': 179, 'm y </w>': 176, 'i s </w>': 163, 'w i t h </w>': 147, 'n o t </w>': 141, 't h a t </w>': 132, 'y o u r </w>': 115, 't h i s </w>': 109, 'f o r </w>': 101, 'm e </w>': 101, 'w i l l </w>': 100, 'a s </w>': 96, 'i t </w>': 95, 'h a v e </w>': 91, 't h o u </w>': 90, 'd o </w>': 89, 'b e </w>': 83, 'h i s </w>': 82, 'h e r </w>': 81, 'T h e </w>': 77, 'a l l </w>': 76, 'h e </w>': 73, 's h a l l </w>': 65, 'w e </w>': 62, 's o </w>': 60, 'a r e </w>': 60, 'T o </w>': 59, 'o n </w>': 58, 't h y </w>': 58, 'b u t </w>': 58, 'n o </w>': 57, 'L Y S A N D E R </w>': 56, 'D E M E T R I U S </w>': 55, 'H E R M I A </w>': 55, 'o u r </w>': 54, 'l o v e </w>': 54, 'b y </w>': 52, 'a m </w>': 52, 'B O T T O M </w>': 52, 's h e </w>': 50, 'f r o m </w>': 50, 't h e i r </w>': 49, 'T H E S E U S </w>': 48, 'B u t </w>': 45, 'T h a t </w

In [57]:
def get_pair_counts(vocab):
    '''
    Get counts of the pairs of all consecutive symbols
    '''

    pair_counts = defaultdict(int)
    for word, frequency in vocab.items():
        chars = word.split()

        for i in range(len(chars) - 1):
            pair_counts[chars[i], chars[i + 1]] += frequency

    return pair_counts

In [58]:
pair_counts = get_pair_counts(vocab)
print(pair_counts)

defaultdict(<class 'int'>, {('\ufeff', 'A'): 1, ('A', '</w>'): 168, ('M', 'i'): 7, ('i', 'd'): 138, ('d', 's'): 102, ('s', 'u'): 68, ('u', 'm'): 29, ('m', 'm'): 17, ('m', 'e'): 583, ('e', 'r'): 1143, ('r', '</w>'): 995, ('N', 'i'): 10, ('i', 'g'): 179, ('g', 'h'): 253, ('h', 't'): 167, ('t', "'"): 16, ("'", 's'): 144, ('s', '</w>'): 1342, ('D', 'r'): 4, ('r', 'e'): 773, ('e', 'a'): 537, ('a', 'm'): 220, ('m', '</w>'): 214, ('S', 'h'): 19, ('h', 'a'): 712, ('a', 'k'): 170, ('k', 'e'): 199, ('e', 's'): 557, ('s', 'p'): 124, ('p', 'e'): 156, ('a', 'r'): 530, ('e', '</w>'): 2472, ('h', 'o'): 380, ('o', 'm'): 303, ('e', 'p'): 88, ('p', 'a'): 135, ('a', 'g'): 73, ('g', 'e'): 141, ('|', '</w>'): 2, ('E', 'n'): 40, ('n', 't'): 340, ('t', 'i'): 167, ('i', 'r'): 227, ('p', 'l'): 111, ('l', 'a'): 193, ('a', 'y'): 248, ('y', '</w>'): 803, ('A', 'C'): 5, ('C', 'T'): 5, ('T', '</w>'): 14, ('I', '</w>'): 417, ('S', 'C'): 9, ('C', 'E'): 55, ('E', 'N'): 53, ('N', 'E'): 9, ('E', '</w>'): 68, ('I', '.'):

In [59]:
most_frequent = max(pair_counts, key=pair_counts.get)
print(most_frequent)

('e', '</w>')


In [60]:
def merge_most_frequent(pair: tuple, v_in: dict) -> dict:
    '''
    Merge all occurrences of the most frequent pair
    '''
    
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    
    for word in v_in:
        # replace most frequent pair in all vocabulary
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]

    return v_out

In [47]:
vocab = merge_most_frequent(most_frequent, vocab)
print(vocab)
# you can see the 'e</w>'s being merged in the following vocabulary

{'\ufeff A </w>': 1, 'M i d s u m m e r </w>': 2, "N i g h t ' s </w>": 2, 'D r e a m </w>': 2, 'S h a k e s p e a r e</w>': 1, 'h o m e p a g e</w>': 1, '| </w>': 2, 'E n t i r e</w>': 1, 'p l a y </w>': 27, 'A C T </w>': 5, 'I </w>': 412, 'S C E N E </w>': 9, 'I . </w>': 8, 'A t h e n s . </w>': 6, 'T h e</w>': 77, 'p a l a c e</w>': 3, 'o f </w>': 248, 'T H E S E U S . </w>': 2, 'E n t e r </w>': 34, 'T H E S E U S , </w>': 4, 'H I P P O L Y T A , </w>': 4, 'P H I L O S T R A T E , </w>': 2, 'a n d </w>': 365, 'A t t e n d a n t s </w>': 2, 'T H E S E U S </w>': 48, 'N o w , </w>': 6, 'f a i r </w>': 23, 'H i p p o l y t a , </w>': 3, 'o u r </w>': 54, 'n u p t i a l </w>': 3, 'h o u r </w>': 2, 'D r a w s </w>': 1, 'o n </w>': 58, 'a p a c e ; </w>': 1, 'f o u r </w>': 1, 'h a p p y </w>': 5, 'd a y s </w>': 3, 'b r i n g </w>': 8, 'i n </w>': 205, 'A n o t h e r </w>': 3, 'm o o n : </w>': 2, 'b u t , </w>': 1, 'O , </w>': 16, 'm e t h i n k s , </w>': 2, 'h o w </w>': 16, 's l o 

In [52]:
# now run this K more times
K = 3000
for i in range(K):
    pair_counts = get_pair_counts(vocab)

    if not pair_counts:
        break

    most_frequent = max(pair_counts, key=pair_counts.get)
    vocab = merge_most_frequent(most_frequent, vocab)

    
print(vocab)

{'\ufeffA</w>': 1, 'Midsummer</w>': 2, "Night's</w>": 2, 'Dream</w>': 2, 'Shakespeare</w>': 1, 'homepage</w>': 1, '|</w>': 2, 'Entire</w>': 1, 'play</w>': 27, 'ACT</w>': 5, 'I</w>': 412, 'SCENE</w>': 9, 'I.</w>': 8, 'Athens.</w>': 6, 'The</w>': 77, 'palace</w>': 3, 'of</w>': 248, 'THESEUS.</w>': 2, 'Enter</w>': 34, 'THESEUS,</w>': 4, 'HIPPOLYTA,</w>': 4, 'PHILOSTRATE,</w>': 2, 'and</w>': 365, 'Attendants</w>': 2, 'THESEUS</w>': 48, 'Now,</w>': 6, 'fair</w>': 23, 'Hippolyta,</w>': 3, 'our</w>': 54, 'nuptial</w>': 3, 'hour</w>': 2, 'Draws</w>': 1, 'on</w>': 58, 'apace;</w>': 1, 'four</w>': 1, 'happy</w>': 5, 'days</w>': 3, 'bring</w>': 8, 'in</w>': 205, 'Another</w>': 3, 'moon:</w>': 2, 'but,</w>': 1, 'O,</w>': 16, 'methinks,</w>': 2, 'how</w>': 16, 'slow</w>': 2, 'This</w>': 33, 'old</w>': 6, 'moon</w>': 10, 'wanes!</w>': 1, 'she</w>': 50, 'lingers</w>': 1, 'my</w>': 176, 'desires,</w>': 1, 'Like</w>': 5, 'to</w>': 270, 'a</w>': 233, 'step-dame</w>': 1, 'or</w>': 42, 'dowager</w>': 2, '

In [16]:
def extract_tokens_from_vocab(vocab):
    vocab = {k: v for k, v in sorted(vocab.items(), key=lambda item: item[1])}
    
    print(vocab)
    

In [51]:
extract_tokens_from_vocab(vocab)

{'\ufeff A</w>': 1, 'Sha k esp ear e</w>': 1, 'h ome page</w>': 1, 'En ti re</w>': 1, 'D raw s</w>': 1, 'a pa ce;</w>': 1, 'f our</w>': 1, 'bu t,</w>': 1, 'wan es!</w>': 1, 'l ingers</w>': 1, 'desir es,</w>': 1, 'st e p -da me</w>': 1, 'L ong</w>': 1, 'revenu e.</w>': 1, 'st eep</w>': 1, 'nigh ts</w>': 1, 'ti me;</w>': 1, 'b ow</w>': 1, 'New -b ent</w>': 1, 'heav en,</w>': 1, 'solemn i ti es.</w>': 1, 'Philostra te,</w>': 1, 'S ti r</w>': 1, 'merriment s;</w>': 1, 'A wake</w>': 1, 'per t</w>': 1, 'n im ble</w>': 1, 'mir th;</w>': 1, 'T urn</w>': 1, 'mel an ch ol y</w>': 1, 'fun er al s;</w>': 1, 'compani on</w>': 1, 'pom p.</w>': 1, "woo 'd</w>": 1, 'do ing</w>': 1, 'in juri es;</w>': 1, 'pom p,</w>': 1, 'triumph </w>': 1, 'rev ell ing.</w>': 1, 'r en owned</w>': 1, 'duk e!</w>': 1, 'Ege us:</w>': 1, "w hat's</w>": 1, 'new s</w>': 1, 'com plain t</w>': 1, "be wi tch'd</w>": 1, 'Th ou,</w>': 1, 'r hym es,</w>': 1, 'inter chang ed</w>': 1, 'love- to k ens</w>': 1, 'wind ow</w>': 1, 'sun 

**References**
1. [Speech and Language Processing - Chapter 2](https://web.stanford.edu/~jurafsky/slp3/2.pdf)
