<div class="pagebreak"></div>

# Finite Automata and Regular Expressions

In this notebook, we will look at two closely related concepts: finite automata and regular expressions.

Finite automata (FA) are an abstract concept, but often have real-world counterparts. FA contain a number of states that a given system may have.  Between the different states exist transitions that allow the system to move from one system to another.

One example is the traffic lights at a intersection. Below we have Main Street intersecting with First Avenue. There are two sets of traffic lights - $L_A$ and $L_B$.  Behind the scenes, four states exist.  $S_0$ allows traffic on First Avenue to have the right of way.  If a transition occurs - $\overline{T_A}$ - either from a sensor or timer), then the system to move state $S_1$ where the light becomes yellow.  After a timer, the system then moves to $S_2$.

![](images/TrafficLights.png)

Formally, a finite automata is a 5-tuple $(S,\Sigma,\delta,s_0,A)$ where
- $S$ is a finite set of states.  $s$ is a particular state. $s \in S$
- $\Sigma$ is a finite set of symbols (i.e., the alphabet). $x$ is an input symbol.  $x \in \Sigma$
- $\delta$ : $ S \times \Sigma \rightarrow S$ is the transition function. The inputs to this function are the current state $s$ and the current input symbol $x$, the result is the resulting state $s_n$.  
- $s_0$ is the starting state for the finite automata.
- $A$ is the set of accepting states for the automata.  If the automata is in one of these states after reading all of the input, then the automata outputs "yes" and accepts the input.  If the automata is not in one of these states after consuming the input, the output is "no" and the input is rejected.   $A \subseteq S$

[`wc`](https://web.archive.org/web/20220609090706/https://ss64.com/bash/wc.html) is Linux/Unix command line program that can provide counts of the number of characters, words, and newlines within a file.

One way that we can represent the logic within this program is through a finite automata.  In this scenario, we have two states - is the program currently processing a characters inside a word or outside (not in) a word. Three groups of characters exist: $C_N$ - a new line, $C_{notword}$ - a character that cannot form a word (e.g., space or tab), and $C_{word}$ - characters that can be part of a word. 

![](images/wordcount.png)

If the system is in the "not in a word" state and sees
- $C_N$: increment the character and new line counts, stay in "not in a word" state
- $C_{notword}$: increment the character count,  stay in "not in a word" state
- $C_{word}$: increments the character and word counts, move to the word state

If the system is in the "word" state and sees
- $C_N$: increment the character and new line counts, move to the "not in a word" state
- $C_{notword}$: increment the character count,  move to "not in a word" state
- $C_{word}$: increments the character count, stay in the word state

Other real world examples include vending machines (tracking coins received), elevators (current location, requested floors), and combination locks.

We can also use FA to model web applications.  We can consider the current page of a web application to be a "state".  Transitions, then are the movements we take in moving from one page to another page.  As with the wc example, we can augment the transitions to perform different activities.

Based upon the theoretical concepts of finite automata, we have [regular expressions](https://en.wikipedia.org/wiki/Regular_expression).  Through a powerful notational grammar, regular expressions can find simple and complex text patterns. We'll also see at the end of this notebook how we can go between finite automata and regular expressions.

In the Strings notebook(TODO add reference), we learned how to to perform basic searches for text within Strings.

For example

In [7]:
line = "can we search for the specified value in the string?"
print("value" in line)
print(line.find("value"))

True
32


Quite often, though, we will want to search patterns of text rather than specific text strings.  These patterns include things like phone numbers, email addresses, URLs, and zip codes.  Suppose you were told to find all email address on a large number of web pages.  How could you perform the search?

Regular expressions are sequences of character patterns matched against a string.  We can use regular expressions to test for the presence of a pattern, extract the matched text, or replace text with another pattern.
Practically, we use regular expressions for several tasks:
- Extracting data from unstructured text
- Validating web forms and other sources of input data before processing
- Searching for protected information such as social security numbers in documents.
- Cleaning data

Commonly, we'll refer to regular expressions as "regexes".  

To use regular expressions with Python, import the `re` module

In [1]:
import re

In [None]:
give an example with search.

The basic approach regular expresssions have 
- the search proceeds through a string from the start to the end, stoppping at the first match
- it then trys to match the pattern, consuming both characters from the string as as well items from the regex pattern.
- if the pattern is successfully consumed (everything has matched), then the overall match has occurred and we can get the matched string from the process.

## Basic Functions

### search

### findall

### split

## Metacharacters, Character Classes, and Quantifiers

### metacharacters

### Predefined characters classes

### custom Characters classes

### quantifiers
- *
- +
- ?
- { }. 

## Flags

## Replacing substrings

## Capturing substrings in a match

## Resources

Python Docs: https://docs.python.org/3/library/re.html   https://docs.python.org/3/howto/regex.html


https://regex101.com/

https://www.regextester.com/

https://www.regular-expressions.info/

https://www.freeformatter.com/regex-tester.html


good overview: https://geekflare.com/regular-expression-tester/

short intro: https://learning.oreilly.com/library/view/an-introduction-to/9781492082569/

Regular Expressions Cookbook: https://learning.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch01.html
Introducing Regulare Expressions: https://learning.oreilly.com/library/view/introducing-regular-expressions/9781449338879/index.html

## Expressions