# Regular Expressions

## Learning Objectives

After completing this chapter, you will be able to:

* Understand how regular expressions are constructed.
* Define a language using a regular expression.
* Identify the distinction between a regular expression and the language it generates.

## 1. First Examples of Regular Expressions

In the last chapter we learned about the Kleene star. When applied to an alphabet $\Sigma$ it produced every possible concatenation of letters from that alphabet (including the empty string). So, for example, if $\Sigma = \{a, b\}$ then

$$\displaystyle \Sigma^{*} = \{\lambda, a, b, aa, ab, ba, bb, aaa, \ldots\}$$

Similarly, if $S$ is a collection of words then $S^{*}$ is all possible concatenations of words from $S$, including the empty word. For example, if $S = \{foo, bar\}$ then

$$\displaystyle S^{*} = \{\lambda, foo, bar, foofoo, foobar, barfoo, barbar, \ldots\}$$


Our first example of a regular expression begins with this idea, but instead of applying it to an alphabet or a set of words, it applies the Kleene star to a single letter. So, for example $\textbf{x}^{*}$ could be $\lambda, x, xx, \ldots$.

*Note* - We will frequently write a repeated letter as a power, so $xx = x^{2}$, $xxx = x^{3}$, etc... You can kind of think of the $^{*}$ operator as an undetermined power.

We could use this to define a language. For example, we could define the language $L_{1}$ by $L_{1} = language(\textbf{x}^{*})$, meaning $L_{1}$ is the language that consists of all possible words generated by the rule $\textbf{x}^{*}$:

$$\displaystyle L_{1} = \{\lambda, x, x^{2}, \ldots\}$$


## 2. Combining Regular Expressions

We don't need to limit ourselves to just one letter. For example, the regular expression $\textbf{a}^{*}\textbf{b}^{*}$ generates words that begins with an arbitrary finite number of $a$'s, followed by an arbitrary finite number of $b$'s. The regular expression $(\textbf{a}\textbf{b})^{*}$ (note the parentheses) generates words that are an arbitrary finite sequence of pairs $ab$. So, if $L_{2}$ is $language((\textbf{a}\textbf{b})^{*})$, then

$$\displaystyle L_{2} = \{\lambda, ab, abab, ababab, abababab, \ldots\}$$

Now, suppose instead we wanted to define a language over the alphabet $\Sigma = \{a,b\}$ where a word in the language starts with $a$, and then has any arbitrary number of $b$'s. Well, we could expand our concept of a regular expression a bit and generate this language with $\textbf{a}\textbf{b}^{*}$. This means the letter $a$, followed by any finite number (including zero) of the letter $b$. If $L_{3} = language(\textbf{a}\textbf{b}^{*})$, then

$$\displaystyle L_{3} = \{a, ab, abb, abbb, \ldots\}$$

Note the distinction between $\textbf{a}\textbf{b}^{*}$ and $(\textbf{a}\textbf{b})^{*}$. The parentheses are important!

Finally, suppose we want to define a language $L_{4}$, where each word begins with *either* the letter $a$ or the letter $c$, followed by an arbitrary finite number of $b$'s. Well, we could generate this language with the regular expression $(\textbf{a} + \textbf{c})\textbf{b}^{*}$. Here $(\textbf{a} + \textbf{c})$ means either $a$ or $c$. So,

$$L_{4} = language((\textbf{a} + \textbf{c})\textbf{b}^{*}) = \{a, c, ab, cb, abb, cbb, \ldots\}$$


## 3. Defining Regular Expressions

**Definition**
Formally, the set of *regular expressions* over an alphabet $\Sigma$ is defined by the rules:

* **Rule 1**: Every letter of $\Sigma$ can be made into a regular expression by writing it in boldface; $\boldsymbol{\lambda}$ (the empty string) is a regular expression.

* **Rule 2**: If $\textbf{r}_{1}$ and $\textbf{r}_{2}$ are regular expressions, then so are:

  * $(\textbf{r}_{1})$
  * $\textbf{r}_{1}$ $\textbf{r}_{2}$
  * $\textbf{r}_{1}$ + $\textbf{r}_{2}$
  * $\textbf{r}_{1}^{*}$

* **Rule 3**: Nothing else is a regular expression.

## 4. Examples of Regular Expressions

**Example**: In the alphabet $\Sigma = \{a,b\}$, the language of all words that contain the letter $a$ can be generated by the regular expression:

$$\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*}$$

In the example above, the regular expression is an arbitrary sequence of $a$'s and $b$'s, then the letter $a$, then another arbitrary sequence of $a$'s and $b$'s. Note the arbitrary sequences could be empty. This can generate any word that contains the letter $a$ at least once.

Now, the only words over $\Sigma$ this regular expression could not generate are the words that don't contain the letter $a$ anywhere. These words can be generated with the regular expression $\textbf{b}^{*}$. So, *any* word over the alphabet $\Sigma$ can be generated with the regular expression:

$$\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*} + \textbf{b}^{*}$$

**Example**: The regular expression $(\textbf{a} + \textbf{b})^{*}$ generates all words in the alphabet $\Sigma = \{a,b\}$.

We see here an important distinction between a regular expression and the language it generates. You can have two distinct regular expressions that generate the same language (set of words). For example,

$$language(\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*} + \textbf{b}^{*}) = language((\textbf{a}+\textbf{b})^{*})$$

We say that regular expressions that generate the same language are *equivalent*.

## 5. Defining Regular Languages

We'll now formally define a language generated by a regular expression, but  first we want to define the notion of the product (or concatenation) of two sets of words, the sum (or union) of two sets of words, and the Kleene closure (or star) of a set of words.

**Definitions**

If $S$ and $T$ are two sets of words, then their product $ST$ is all words that are concatenations of a word from the first and a word from the second. So, for example, if $S$ is words in the English language, and $T$ is words in the Spanish language, then their product would include such words as "machinetonto" and "computerbueno".

If $S$ and $T$ are two sets of words, then their sum $S+T$ is all words that are either from $S$ or $T$. This is also known as the *union* of the sets $S$ and $T$. So, for example, if $S$ and $T$ are again English and Spanish words then $S+T$ would contain any word that is either English or Spanish, so it would contain, for example, both "machine" and "bueno".

If $S$ is a set of words, then the Kleene closure $S^{*}$ is the concatenation of an arbitrary finite sequence of words from $S$. This includes the empty word $\lambda$, which is a concatenation of no words  from $S$. So, for example, if $S$ is is words in the English language, then $S^{*}$ would contain "computer", "computercomputer", "computersciencestudent", and so on.

Using these definitions, we can formally define the language generated by a given regular expression.

**Definition**

The following rules define the *language associated with a regular expression*:

* **Rule 1**: The language associated with the regular expression that is just a single letter is that one-letter word alone. The language associated with $\boldsymbol{\lambda}$ is just $\{\lambda\}$, the language consisting of just the empty string.
        
* **Rule 2**: If $L_{1}$ is the language associated with the regular expression $\textbf{r}_{1}$, and $L_{2}$ is the language associated with the regular expression $\textbf{r}_{2}$, then:
  * The product $L_{1}L_{2}$ (the language $L_{1}$ times $L_{2}$) is the language associated with the regular expression $\textbf{r}_{1}\textbf{r}_{2}$:

$$language(\textbf{r}_{1}\textbf{r}_{2}) = L_{1}L_{2}$$
    
  * The language formed by the union of the sets $L_{1}$ and $L_{2}$ is associated with the regular expression $\textbf{r}_{1} + \textbf{r}_{2}$:

$$language(\textbf{r}_{1} + \textbf{r}_{2}) = L_{1} + L_{2}$$

  * $L_{1}^{*}$, the Kleene closure of the set of words $L_{1}$, is the language associated with the regular expression $\textbf{r}_{1}^{*}$:

$$language(\textbf{r}_{1}^{*}) = L_{1}^{*}$$

Now that we've got the formal definition, there are some important questions that we can ask right away, including:

* Is there a formal, algorithmic way of telling whether the same language is associated with two different regular expressions? That is, can we tell if the two regular expressions are equivalent? Turns out the answer is "yes", and we'll learn how to do this later.
    
* Is every possible language associated with a regular expression? Turns out the answer is "no", and we'll see some counterexamples later.


**Example**: All finite languages are regular. If a language is finite, we can just list its words - $w_{1}, w_{2}, \ldots, w_{N}$ - and represent it with the regular expression $(\textbf{w}_{1} + \textbf{w}_{2} + \cdots + \textbf{w}_{N})$.

## 6. Implementing Basic Regular Expressions in Python

Python has a regular expression library called *re* that can be used for matching strings to regular expressions. Let's import it and play around with it a bit.

In [None]:
import re

There is a function called *match* in the *re* library. However, that function only checks whether the start of a string matches a regular expression. To determine whether an entire string matches a regular expression, we need to use the *fullmatch* function.

What the *fullmatch* function returns is a bit more complicated than what we need here (see the documentation linked at the bottom of this chapter), so we'll write a short helper function that will return either **True** or **False** depending on whether a string matches a given regular expression pattern.

In [None]:
def is_full_match(pattern, string):
    return re.fullmatch(pattern, string) is not None

This function determines whether the string given in the second argument matches the regular expression (pattern) given in the first argument. We'll now discuss how can specify this pattern.

The simplest type of match is just a direct string match. In this case, the pattern is just a string. For example, the identical strings below will return **True**:

In [None]:
is_full_match("Weber", "Weber")

The different strings below will return **False**:

In [None]:
is_full_match("Weber", "State")

Note that matching here is case sensitive:

In [None]:
is_full_match("Weber","weber")

We can extend this to matching multiple possible characters at a given position. So, for example, the pattern *\[ac\]b* will match any string that begins with either "a" *or* "c", followed by "b". This is equivalent to the regular expression $(\textbf{a}+\textbf{c})\textbf{b}$ expressed in the notation from earlier in this chapter.

For example, both of these will return **True**:

In [None]:
is_full_match("[ac]b", "ab")

In [None]:
is_full_match("[ac]b", "cb")

However, this pattern won't match with "bb", and will return **False**:

In [None]:
is_full_match("[ac]b", "bb")

We can also use the $*$ operator in a pattern the same way as in our regular expression notation. So, for example, the regular expression $\textbf{a}\textbf{b}^{*}$ would correspond with the pattern "ab*":

In [None]:
is_full_match("ab*", "ab")

In [None]:
is_full_match("ab*", "abb")

In [None]:
is_full_match("ab*", "a")

In [None]:
is_full_match("ab*", "bb")

The regular expression $(\textbf{a}\textbf{b})^{*}$ would correspond with the pattern "(ab)*":

In [None]:
is_full_match("(ab)*","ab")

In [None]:
is_full_match("(ab)*","abab")

In [None]:
is_full_match("(ab)*","")

In [None]:
is_full_match("(ab)*","abb")

Using just these, we can create a Python regular expession pattern corresponding with any regular expression specified according to the definition in section 5 above.

Please note we've just scratched the surface of the *re* library and how it can be used in Python. There's much more there, and the interested reader is encouraged to consult the online documentation referenced in the "Further Reading" section (8) below.

## 7. Practice Exercises

**Exercise 1** - Provide a regular expression in the alphabet $\Sigma = \{a,b\}$ that can generate all strings in which the letter $b$ in *never* tripled. This means no word contains the substring $bbb$.

**Exercise 2** - Show that the regular expressions $(\textbf{a}^{*}\textbf{b})^{*}\textbf{a}^{*}$ and $\textbf{a}^{*}(\textbf{b}\textbf{a}^{*})^{*}$ generate the same language.

**Exercise 3** - Write a Python regular expression pattern corresponding with the regular expression $\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*} + \textbf{b}^{*}$, and use the *is_full_match* method above to verify it matches the strings "bab", "abbabba", and "bbbbb".

## 8. Further Reading

* [Introduction to Computer Theory](https://www.amazon.com/Introduction-Computer-Theory-Daniel-Cohen/dp/0471137723) (Second Edition) by Daniel I.A. Cohen

  *Chapter 4 - Regular Expressions*

* [Automata Theory, Languages, and Computation](https://www.amazon.com/Introduction-Automata-Theory-Languages-Computation/dp/0321462254) (Third Edition) by Hopcroft, Motwani, and Ullman

  *Section 3.1 - Regular Expressions*

* [Introduction to the Theory of Computation](https://www.cengage.com/c/introduction-to-the-theory-of-computation-3e-sipser/9781133187790/) (Third Edition) by Michael Sipser
  
  *Section 1.3 - Regular Expressions*

* The Python [re library](https://docs.python.org/3/library/re.html)
* Python regular expressions [HOWTO](https://docs.python.org/3/howto/regex.html)