<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 10: regular expressions

This notebook introduces the notion of *formal languages* and defines one of the ways formal languages can be classified (based on their expressive power).
It then introduces a hierarchy of formal languages, and goes in depth about the class of _regular languages._
They can be written as **regular expressions**, and this notebook mostly focuses on working with regular languages using Python `re` package.

## Formal languages

Formally, (string) _languages_ are collections of strings produced by a certain grammar.
Grammar is a way to _generalize_ the pattern behind the language.
For example,

    Language 1: ab, abab, ababab, abababab... (*aba, *babababab, *a)
    Grammar 1:  repeat "ab" arbitrary number of times
    
    Language 2: aaab, baaa, aaabaaaa, b, aba, aaaaabaaaaaaaaa, baaaaaaaa, aab... (*aaaabaaaabaaa, *aaaaa)
    Grammar 2:  have a single "b" in the string, and then any number of "a" on either side of "b"
    
    Language 3: abc, aabbcc, aaabbbccc, aaaabbbbcccc... (*abbccc, *aaabbbc)
    Grammar 3:  have "b" after "a" and "c" after "b", and repeat every letter n times.
    
    Language 4: aabb, aaabbb, aaaaabbbbb, aaaaaaabbbbbbb, aaaaaaaaaaabbbbbbbbbbb... (*ab, *aaaabbbb)
    Grammar 4:  <guess>
    
These grammars are very discriptive, and we want to have a way to _formalize_ them, therefore obtaining a **formal grammar** for that language.


### Expressivity

It is intuitive that some rules generating languages use simpler operations than others.
For example, Grammar 1 simply uses the repetitions of the substring "ab", whereas Grammar 3 _balances_ the number of letters "a", "b" and "c" while preserving their order.

The nested hierarchy of formal languages aligned with respect to their complexity is called **the Chomsky hierarchy** [(Chomsky 1959)](http://www.cs.utexas.edu/~cannata/pl/Class%20Notes/Chomsky_1959%20On%20Certain%20Formal%20Properties%20of%20Grammars.pdf).

<img src="images/10_1.png" width="600">

**Finite** languages are not infinite, and can be defined by simply listing all the strings of the language.

In [None]:
finite_language = ["abc", "cba", "bac", "cab", "acb", "bca"]

All classes apart from finite grammars describe potentially infinite stringsets. Later in this course, we will learn some tools for working with two classes of Chomsky hierarchy: **regular** and **context-free**.

**Question:** why "potentially infinite", and not just "infinite"?

## Regular expressions

Regular languages can be described via so-called **regular expressions** (**regex**).
Regexes are strings in a special representation describing languages, and this representation is invented by **Stephen Kleene** in 1950s. Interestingly, he came up with the idea of regular expressions when trying to describe the behavior of _McCulloch-Pitts_ neural networks, the first model inspired by human neurons! See more [here](https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1).

<img src="images/10_2.png" width="200">
<center>
    <i> Stephen Kleene </i>
</center>


In order to work with regexes through the Python interface, let's import the `re` package.

In [None]:
import re

## Kleene star

The basic concept introducing infinity in regular expressions is **Klenee star**.
It is denoted as `*`, and it simply means "repeat the preceding symbol or a string arbitrary number of times".
Here and further in the notebook, I'll be using `""` to denote an empty string.

    RegEx:  a*
    Language: "", a, aa, aaa, aaaaa, ..., aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, ...
    
### Function `re.search`
    
The function `re.search` searches for the first substring that correspond to a given regular expressions within the given string. It returns a _match object_ if matching substrings were detected, and `None` otherwise.

    re.search(r"regex", r"string")
    
Notice the `r` before the strings: it means "raw string". As you already know, there are characters that have special meaning to regular expressions, and to avoid any confusion, it is easier to work with the raw strings.

In [None]:
print(re.search(r"a*", r"aaaaa"))

Indeed, a string "aaaaa" can be matched by the regular expression $a^{*}$.

In [None]:
print(re.search(r"a*", r"aaaabaa"))

String "aaaabaa" contains $2$ substrings that can be matched by a regular expression $a^{*}$: "**aaaa**baa" and "aaaab**aa**", but `re.search` only looks for the first appropriate match, therefore the span that was detected is `span=(0, 4)`.

The values of `span` can be extracted in the following manner:

In [None]:
matched_object = re.search(r"a*", r"aaaabaa")
span = matched_object.span()
print(span)

To extract the matched substring, we need to apply `.group(0)` to the matched object.

In [None]:
match = matched_object.group(0)
print(match)

As a sidenote, Python allows for "unpacking" of the values and multiple variable definitions on a single line.

In [None]:
start, end = span[0], span[1]
print("Start:", start)
print("End:  ", end)

**Practice:** write a function `full_match` detecting if the whole given string can be matched by the given regular expression.

    Function call:  full_match("a*", "aaaaa")
    Output:         True
    
    Function call:  full_match("a*", "aaaaabaa")
    Output:         False
    
    Function call:  full_match("a*", "bbbbb")
    Output:         False

Test it in the following cell.

### Function `re.findall`

If all matching substrings are required, we can use `re.findall` function that returns _all_ matching substrings.

    re.findall(r"regex", r"string")

In [None]:
print(re.findall(r"ab", r"aaabbbababa"))

**Question:** explain the output of the following function call.

In [None]:
print(re.findall(r"a*", r"aaaabaa"))

So far we saw the application of the Kleene star to only a single symbol. In order to match repeating substrings, we need to put that substring in the parethesis. For example, see the contrast:

In [None]:
print(re.search(r"(ab)*", r"abababbbb"))

In [None]:
print(re.search(r"ab*", r"abababbbb"))

In the first case, the first match is `ababab`, i.e. it is a string "ab" that is repeating.
In the second case, however, the first match is simply `ab`, because the regular expression $ab^{*}$ is matching any number of "b" preceded by an "a".

## Kleene plus

Kleene star matches _any_ number of the given substring, from $0$ to any $n$.
**Kleene plus** matches the given substring $1$ or more times.

    Grammar:  (ab)*
    Language: "", ab, abab, ababab, ababababab...
    
    Grammar:  (ab)+
    Language: ab, abab, ababab, ababababab...

In [None]:
print(re.search(r"(ab)*", r""))

In [None]:
print(re.search(r"(ab)+", r""))

**Practice 1:** simplify the following regular expression: `ab(ab)+`.

**Practice 2:** for the following regular expressions, tell if the given strings are fully matched by them.

    Regular expression:  a*b*a*
    Strings:             "", aaaaa, bbbb, aaabbbbb, aabaaa, aaabbbabbb, bbbaa, bab
    
    Regular expression:  a*b+a*
    Strings:             "", aaaaa, bbbb, aaabbbbb, aabaaa, aaabbbabbb, bbbaa, bab
    
    Regular expression:  (a*b*)*
    Strings:             "", aabb, aaabbb, aabbaaaabbbb, bbbb, baaaaa, bababbabb
    
    Regular expression:  (a+b*)*
    Strings:             "", aabb, aaabbb, aabbaaaabbbb, bbbb, baaaaa, bababbabb
    
    Regular expression:  (a+c*b+)+
    Strings:             "", ab, abab, aaaacc, bbba, aaabbbabbbb, aabbbbbcbbb
    
    Regular expression:  (a*c*b+)+
    Strings:             "", ab, abab, aaaacc, bbba, aaabbbabbbb, aabbbbbcbbb
    
**Practice 3:** come up with a regex describing the following two languages. Test it in the following cells.

    Language 1:  "", abbc, bbbcc, ccc, acc, abbccc...
    Language 2:  abccc, a, abcbcccc, abbb, abbbccc...

In [None]:
lang_1 = ["", "abbc", "bbbcc", "ccc", "acc", "abbccc"]
for s in lang_1:
    print(re.search(r"REGEX", s))

In [None]:
lang_2 = ["abccc", "a", "abcbcccc", "abbb", "abbbccc", "abcbcbc"]
for s in lang_2:
    print(re.search(r"REGEX", s))

### Optionality

### Counting repetitions

### Matching in the beginning or end of the string

### Sets of characters

### Negation

### Any character `.`

### Special sequences