# Regular Expressions, Text Normalization, and Edit Distance

## Regular Expressions

### The Basics

Formally, a __regular expression__ is algebraic notation for characterizing sets of strings. A _pattern_ searches through a _corpus_ of texts. The unix tool `grep` takes a regex and returns every line of the corpus that matches the expression.  

Regex is __case sensitive__. To use a _disjunction_ of expressions, use `[a b]` to combine multiple conditions. The dash (`-`) can be used to specify a range. The caret `^` is used to express _negation_. The question mark `?` is used to denote zero or one instances, and the __Kleene__ `*` is used to express zero or more. The __Kleene__ `+` is then used to express one or more. The __wildcard__ expression `.` matche _any_ character except a carriage return. `.` and `+` are often used in conjunction.

__Anchors__ signify particular places in a string. `^` matches the start of a line when outisde of brackets and `$` the end. Additionally, `\b` matches a word boundary and `\B` matches a non boundary.

The __disjunction__ operator `|` signifies "either / or." Parenthese can be used to signify _precedence_ in operators. The pattern `/gupp(y|ies)` would first match for suffix disjunction and then the larger word. Operators that match _as much_ as possible are __greedy__, while those that match as little as possible are __non-greedy__.

The two main types of erros are __false positives__ and __false negatives__. Reducing overall error rate thus involves two antagonistic efforts: increasing __precision__ by minimizing false positives and increasing __recall__ by reducing false negatives.

### Substitution, Capture, and Lookahead

__Substitutions__ are commonly implemented in the form of `s/regex/pattern/`. To make this easier, the __number__ operator `\` is used as back-reference, such as in the form of `s/([0-9]+)/<\1>`. This use of parentheses to store a pattern in memory is a __capture group__, where the resulting match is stored in a numbered __register__. 

The __lookahead__ assertion makes use of the operator `(?= pattern)` but is _zero-wdith_, which means the match pointer doesn't advance. Negative lookahead is commonly used when parsing a complex pattern and trying to rule out some special case. For example: to match any single word that doesn't start with "Volcano": `/^(?!Volcano)[a-zA-z]+/`

## Words

An __utterance__ is the spoken correlate of a sentence:

> I do uh main- mainly business data processing

This utterance has two kinds of __disfluencies__: the broken-off word _main_ is a __fragment__, and words like _uh_ and _uhm_ are __fillers__ / __filled pauses__. Disfluencies should be kept or discarded depending on the purpose. They may be discarded in transcription or kept in recognition, since they might signal the restart of a clause or idea.

A __lemma__ is a set of lexical forms with the same stem, major part-of-speech, and word sense. Inflected forms like _cats_ and _cat?_ share _cat_ as their lemma. The __word-form__ is the full inflected, or derived form of the word. 