## Regular Expressions, Text Normalization, Edit Distance

**text normalization** - converting text to a more convenient, standard form

**lemmatization** - the task of determining that two words have the same root, despite their surface differences
- sang, sung, and sings are forms of the verb sing
- lemmatizer (a function) maps these words to their lemma, sing

**sentence segmentation** - breaking up a text into individual sentences, using cues like periods or exclamation points

**edit distance** - metric that measures how similar two strings are based on the number of edits (insertions, deletions, substitutions) it takes to change one string into the other

### regular expressions

a language for specifying text search strings

**quick regex review:**
 
disjunction
- `[wW]oodchuck` - Woodchuck or woodchuck
- `[abc]`  - ‘a’, ‘b’, or ‘c’
- `gupp(y|ies)` - guppy or guppies

range and `^` as *negation*
- `[0-9]`  - a single digit 0-9
- `[ ˆA-Z]` - not an upper case letter

optional elements: `?`
- `colou?r` - color or colour

kleene star - zero or more occurrences of the immediately previous character or regular expression
- `[ab]*` - aaaa, ababab, bbbb

wildcard `.`
- `beg.n`: begin, beg’n, begun

anchors
- `ˆThe box\.$` - a line that contains only the phrase `The box`
- /\bthe\b/ - `the` (but not the word other)

### words

**corpus** - a computer-readable collection of text or speech

**utterance** - the spoken correlate of a sentence

*I do uh main- mainly business data processing*

- disfluencies occur in spoken sentences
    - uh and um are called fillers
    - sometimes these helpful because they may signal the restart of a clause or idea

**lemma** - is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense
- box, boxes


### Text Normalization

1. Tokenizing (segmenting) words
2. Normalizing word formats
3. Segmenting sentences

unix example of tokenizing a text file

`tr -sc ’A-Za-z’ ’\n’ < sh.txt | tr A-Z a-z | sort | uniq -c | sort -n -r`

- changes every sequence of nonalphabetic characters to a newline
- \-c option complements to non-alphabet
- \-s option squeezes all sequences into a single character
- \-n option sorts numerically rather than alphabetically
- \-r option means to sort in reverse order

result:
```
27378 the
26084 and
22538 i
19771 to
17481 of
14725 a
13826 you
12489 my
11318 that
11112 in
```

**function words** - articles, pronouns, prepositions, the most frequent corpora

**named entity detection** - the task of detecting names, dates, and organizations 


**morpheme** - the smallest meaning-bearing unit of a language

ML systems learn facts about words in a training corpus and then use that to make decisions about a test corpus 

### byte-pair encoding for tokenization

based on a method for text compression, the intuition of the algorithm is to iteratively merge frequent pairs of characters

algorithm:
1. initialize a "vocabulary" with the set of symbols equal to the set of characters plus a "_"
2. represent each word in dictionary as a sequence of characters in the vocabulary
3. count the number of symbol pairs in the current dictionary
4. find the most frequent symbol pair
5. add the merged symbol to our vocabulary
6. merge that symbol pair across the dictionary
7. repeat #3-#6 K times
8. the resulting vocabulary will consist of the original set of characters plus k new symbols



In [12]:
d = {"low": 5, "lowest": 2, "newer": 6, "wider": 3, "new": 2}

def byte_pair_tokenize(D, K):
    vocab = set()
    vocab.add("_")
    merged = {}
    for key, val in D.items():
        vocab.update(list(key))
        merged[key] = list(key)
        
    print(merged)    
    print(vocab)
    
byte_pair_tokenize(d, 1)
    

{'low': ['l', 'o', 'w'], 'lowest': ['l', 'o', 'w', 'e', 's', 't'], 'newer': ['n', 'e', 'w', 'e', 'r'], 'wider': ['w', 'i', 'd', 'e', 'r'], 'new': ['n', 'e', 'w']}
{'s', 'd', 't', 'o', 'i', 'r', '_', 'l', 'n', 'w', 'e'}
