# Regular Expressions

## Learning Objectives
After completing this chapter, you will be able to:

* Understand how regular expressions are constructed.
* Define a language using a regular expression.
* Identify the distinction between a regular expression and the language it generates.

## 1. First Examples of Regular Expressions

In our last lecture we learned about the Kleene star. When applied to an alphabet $\Sigma$ it produced every possible concatenation of letters from that alphabet (including the empty string). So, for example, if $\Sigma = \{a, b\}$ then

<br>
<center>
    $\displaystyle \Sigma^{*} = \{\lambda, a, b, aa, ab, ba, bb, aaa, \ldots\}$.
</center>
<br>

Similarly, if $S$ is a collection of words then $S^{*}$ is all possible concatenations of words from $S$, including the empty word. For example, if $S = \{foo, bar\}$ then

<center>
    $\displaystyle S^{*} = \{\lambda, foo, bar, foofoo, foobar, barfoo, barbar, \ldots\}$.
</center>

A regular expression begins with this idea, but instead of applying it to an alphabet or a set of words, it applies the Kleene star to a single letter. So, for example $\textbf{x}^{*}$ could be $\lambda, x, xx, \ldots$. *Note* - We will frequently write a repeated letter as a power, so $xx = x^{2}$, $xxx = x^{3}$, etc... You can kind of think of the $^{*}$ operator as an undetermined power.

We could use this to define a language. For example, we could define the language $L_{1}$ by $L_{1} = language(\textbf{x}^{*})$, meaning $L_{1}$ is the language that consists of all possible words generated by the rule $\textbf{x}^{*}$:

<center>
    $\displaystyle L_{1} = \{\lambda, x, x^{2}, \ldots\}$.
</center>

## 2. Combining Regular Expressions

We can introduce more than one letter. For example, the regular expression $\textbf{a}^{*}\textbf{b}^{*}$ generates words that begins with an arbitrary finite number of $a$'s, followed by an arbitrary finite number of $b$'s. The regular expression $(\textbf{a}\textbf{b})^{*}$ (note the parentheses) generates words that are an arbitrary finite sequence of pairs $ab$. So, if $L_{2}$ is $language((\textbf{a}\textbf{b})^{*})$, then

<center>
    $\displaystyle L_{2} = \{\lambda, ab, abab, ababab, abababab, \ldots\}$.
</center>

Now, suppose instead we wanted to define a language over the alphabet $\Sigma = \{a,b\}$ where a word in the language starts with $a$, and then has any arbitrary number of $b$'s. Well, we could expand our concept of a regular expression a bit and generate this language with $\textbf{a}\textbf{b}^{*}$. This means the letter $a$, followed by any finite number (including zero) of the letter $b$. If $L_{3} = language(\textbf{a}\textbf{b}^{*})$, then

<center>
    $\displaystyle L_{3} = \{a, ab, abb, abbb, \ldots\}$.
</center>

Note the distinction between $\textbf{a}\textbf{b}^{*}$ and $(\textbf{a}\textbf{b})^{*}$. The parentheses are important!

Finally, suppose we want to define a language $L_{4}$, where each word begins with *either* the letter $a$ or the letter $c$, followed by an arbitrary finite number of $b$'s. Well, we could generate this language with the regular expression $(\textbf{a} + \textbf{c})\textbf{b}^{*}$. Here $(\textbf{a} + \textbf{c})$ means either $a$ or $c$. So,

<center>
    $L_{4} = language((\textbf{a} + \textbf{c})\textbf{b}^{*}) = \{a, c, ab, cb, abb, cbb, \ldots\}$.
</center>

## 3. Defining Regular Expressions

---

**Definition**
Formally, the set of *regular expressions* over an alphabet $\Sigma$ is defined by the rules:

* **Rule 1**: Every letter of $\Sigma$ can be made into a regular expression by writing it in boldface; $\boldsymbol{\lambda}$ itself is a regular expression.

* **Rule 2**: If $\textbf{r}_{1}$ and $\textbf{r}_{2}$ are regular expressions, then so are:

  * $(\textbf{r}_{1})$
  * $\textbf{r}_{1}$ $\textbf{r}_{2}$
  * $\textbf{r}_{1}$ + $\textbf{r}_{2}$
  * $\textbf{r}_{1}^{*}$

* **Rule 3**: Nothing else is a regular expression.

---

## 4. Examples of Regular Expressions

**Example**: In the alphabet $\Sigma = \{a,b\}$, the language of all words that contain the letter $a$ can be generated by the regular expression:

<center>
    $\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*}$
</center>

In the example above, the regular expression is an arbitrary sequence of $a$'s and $b$'s, then the letter $a$, then another arbitrary sequence of $a$'s and $b$'s. Note the arbitrary sequences could be empty. This can generate any word that contains the letter $a$ at least once.

Now, the only words over $\Sigma$ this example would not contain are the words that don't contain the letter $a$ anywhere. These words can be generated with the regular expression $\textbf{b}^{*}$. So, any word over the alphabet can be generated with the regular expression:

<center>
    $\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*} + \textbf{b}^{*}$.
</center>

**Example**: The regular expression $(\textbf{a} + \textbf{b})^{*}$ generates all words in the alphabet $\Sigma = \{a,b\}$.\end{exmp}

So, now we see an important distinction between a regular expression and the language it generates. You can have two distinct regular expressions that generate the same language (set of words). For example,

<center>
    $language(\textbf{(a+b)}^{*}\textbf{a}\textbf{(a+b)}^{*} + \textbf{b}^{*}) = language((\textbf{a}+\textbf{b})^{*})$.
</center>

We say that regular expressions that generate the same language are *equivalent*.

##1.2.5 Defining Regular Languages

We'll now formally define a language generated by a regular expression, but  first we want to define the notion of the product of two sets of words.

**Definition**

---

If $S$ and $T$ are two sets of words, then their product $ST$ is all words that are concatenations of a word from the first and a word from the second. So, for example, if $S$ is words in the English language, and $T$ is words in the Spanish language, then their product would include such words as "machinetonto" and "computerbueno".

---

Using this, we can define a language generated by a regular expression.

**Definition**

---

The following rules define the *language associated with a regular expression*:

* **Rule 1**: The language associated with the regular expression that is just a single letter is that one-letter word alone and the language associated with $\boldsymbol{\lambda}$ is just $\{\lambda\}$, a one-word language.
        
* **Rule 2**: If $\textbf{r}_{1}$ is a regular expression associated with the language $L_{1}$ and $\textbf{r}_{2}$ is a regular expression associated with the language $L_{2}$, then:
  * The regular expression $\textbf{r}_{1}\textbf{r}_{2}$ is associated with the product $L_{1}L_{2}$ that is the language $L_{1}$ times $L_{2}$:
  <center>
    $language(\textbf{r}_{1}\textbf{r}_{2}) = L_{1}L_{2}$
  </center>
  * The regular expression $\textbf{r}_{1} + \textbf{r}_{2}$ is associated with the language formed by the union of the sets $L_{1}$ and $L_{2}$:
  <center>
    $language(\textbf{r}_{1} + \textbf{r}_{2}) = L_{1} + L_{2}$
  </center>
  * The language associated with the regular expression $\textbf{r}_{1}^{*}$ is $L_{1}^{*}$, the Kleene closure of the set $L_{1}$ as a set of words:
  <center>
    $language(\textbf{r}_{1}^{*}) = L_{1}^{*}$
  </center>

---

Now that we've got the formal definition, there are some important questions that we can ask right away, including:

* Is there a formal, algorithmic way of telling whether two regular expression are associated with the same language? That is, can we tell if the two regular expressions are equivalent? Turns out the answer is ``yes", and we'll learn how to do this later.
    
* Can every language be described by a regular expression? Turns out the answer is ``no", and we'll see some counterexamples later.


**Example**: All finite languages are regular. If a language is finite, we can just list its words - $w_{1}, w_{2}, \ldots, w_{N}$ - and represent it with the regular expression $(\textbf{w}_{1} + \textbf{w}_{2} + \cdots + \textbf{w}_{N})$.

##1.1.6 Practice Exercises

**Exercise 1** - Provide a regular expression in the alphabet $\Sigma = \{a,b\}$ that can generate all strings in which the letter $b$ in *never* tripled. This means no word contains the substring $bbb$.

**Exercise 2** - Show that the regular expressions $(\textbf{a}^{*}\textbf{b})^{*}\textbf{a}^{*}$ and $\textbf{a}^{*}(\textbf{b}\textbf{a}^{*})^{*}$ generate the same language.

## 5. Further Reading

* "Introduction to Computer Theory" by Daniel I.A. Cohen *Chapter 2 - Languages, Chapter 3 - Recursive Definitions*
* "Automata Theory, Languages, and Computation" by Hopcroft, Motwani, and Ullman *Section 1.5 - The Central Concepts of Automata Theory*