<div style="text-align: right">
    <i>
        LIN 537: Computational Lingusitics 1 <br>
        Fall 2019 <br>
        Alëna Aksënova
    </i>
</div>

# Notebook 10: regular expressions

This notebook introduces the notion of *formal languages* and defines one of the ways formal languages can be classified (based on their expressive power).
It then introduces a hierarchy of formal languages, and goes in depth about the class of _regular languages._
They can be written as **regular expressions**, and this notebook mostly focuses on working with regular languages using Python `re` package.

## Formal languages

Formally, (string) _languages_ are collections of strings produced by a certain grammar.
Grammar is a way to _generalize_ the pattern behind the language.
For example,

    Language 1: ab, abab, ababab, abababab... (*aba, *babababab, *a)
     Grammar 1: repeat "ab" arbitrary number of times.
    
    Language 2: aaab, baaa, aaabaaaa, b, aba, aaaaabaaaaaaaaa, baaaaaaaa, aab... (*aaaabaaaabaaa, *aaaaa)
     Grammar 2: have a single "b" in the string, and then any number of "a" on either side of "b".
    
    Language 3: abc, aabbcc, aaabbbccc, aaaabbbbcccc... (*abbccc, *aaabbbc)
     Grammar 3: have "b" after "a" and "c" after "b", and repeat every letter n times.
    
    Language 4: aabb, aaabbb, aaaaabbbbb, aaaaaaabbbbbbb, aaaaaaaaaaabbbbbbbbbbb... (*ab, *aaaabbbb)
     Grammar 4: n times "a" and n times "b", where n is a prime number.
    
These grammars are very discriptive, and we want to have a way to _formalize_ them, therefore obtaining a **formal grammar** for that language.


### Expressivity

It is intuitive that some rules generating languages use simpler operations than others.
For example, Grammar 1 simply uses the repetitions of the substring "ab", whereas Grammar 3 _balances_ the number of letters "a", "b" and "c" while preserving their order.

The nested hierarchy of formal languages aligned with respect to their complexity is called **the Chomsky hierarchy** [(Chomsky 1959)](http://www.cs.utexas.edu/~cannata/pl/Class%20Notes/Chomsky_1959%20On%20Certain%20Formal%20Properties%20of%20Grammars.pdf).

<img src="images/10_1.png" width="600">

**Finite** languages are not infinite, and can be defined by simply listing all the strings of the language.

In [None]:
finite_language = ["abc", "cba", "bac", "cab", "acb", "bca"]

All classes apart from finite grammars describe potentially infinite stringsets. Later in this course, we will learn some tools for working with two classes of Chomsky hierarchy: **regular** and **context-free**.

**Question:** why "potentially infinite", and not just "infinite"?

## Regular expressions

Regular languages can be described via so-called **regular expressions** (**regex**).
Regexes are strings in a special representation describing languages, and this representation is invented by **Stephen Kleene** in 1950s. Interestingly, he came up with the idea of regular expressions when trying to describe the behavior of _McCulloch-Pitts_ neural networks, the first model inspired by human neurons! See more [here](https://towardsdatascience.com/mcculloch-pitts-model-5fdf65ac5dd1).

<img src="images/10_2.png" width="200">
<center>
    <i> Stephen Kleene </i>
</center>


In order to work with regexes through the Python interface, let's import the `re` package.

In [None]:
import re

## Kleene star

The basic concept introducing infinity in regular expressions is **Klenee star**.
It is denoted as `*`, and it simply means "repeat the preceding symbol or a string arbitrary number of times".
Here and further in the notebook, I'll be using `""` to denote an empty string.

    RegEx:  a*
    Language: "", a, aa, aaa, aaaaa, ..., aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, ...
    
### Function `re.search`
    
The function `re.search` searches for the first substring that correspond to a given regular expressions within the given string. It returns a _match object_ if matching substrings were detected, and `None` otherwise.

    re.search(r"regex", r"string")
    
Notice the `r` before the strings: it means "raw string". As you already know, there are characters that have special meaning to regular expressions, and to avoid any confusion, it is easier to work with the raw strings.

In [None]:
print(re.search(r"a*", r"aaaaa"))

Indeed, a string "aaaaa" can be matched by the regular expression $a^{*}$.

In [None]:
print(re.search(r"a*", r"aaaabaa"))

String "aaaabaa" contains $2$ substrings that can be matched by a regular expression $a^{*}$: "**aaaa**baa" and "aaaab**aa**", but `re.search` only looks for the first appropriate match, therefore the span that was detected is `span=(0, 4)`.

The values of `span` can be extracted in the following manner:

In [None]:
matched_object = re.search(r"a*", r"aaaabaa")
span = matched_object.span()
print(span)

To extract the matched substring, we need to apply `.group(0)` to the matched object.

In [None]:
match = matched_object.group(0)
print(match)

As a sidenote, Python allows for "unpacking" of the values and multiple variable definitions on a single line.

In [None]:
start, end = span[0], span[1]
print("Start:", start)
print("End:  ", end)

**Practice:** write a function `full_match` detecting if the whole given string can be matched by the given regular expression.

    Function call:  full_match("a*", "aaaaa")
    Output:         True
    
    Function call:  full_match("a*", "aaaaabaa")
    Output:         False
    
    Function call:  full_match("a*", "bbbbb")
    Output:         False

Test it in the following cell.

### Function `re.findall`

If all matching substrings are required, we can use `re.findall` function that returns _all_ matching substrings.

    re.findall(r"regex", r"string")

In [None]:
print(re.findall(r"ab", r"aaabbbababa"))

**Question:** explain the output of the following function call.

In [None]:
print(re.findall(r"a*", r"aaaabaa"))

So far we saw the application of the Kleene star to only a single symbol. In order to match repeating substrings, we need to put that substring in the parethesis. For example, see the contrast:

In [None]:
print(re.search(r"(ab)*", r"abababbbb"))

In [None]:
print(re.search(r"ab*", r"abababbbb"))

In the first case, the first match is `ababab`, i.e. it is a string "ab" that is repeating.
In the second case, however, the first match is simply `ab`, because the regular expression $ab^{*}$ is matching any number of "b" preceded by an "a".

## Kleene plus

Kleene star matches _any_ number of the given substring, from $0$ to any $n$.
**Kleene plus** matches the given substring $1$ or more times.

    Grammar:  (ab)*
    Language: "", ab, abab, ababab, ababababab...
    
    Grammar:  (ab)+
    Language: ab, abab, ababab, ababababab...

In [None]:
print(re.search(r"(ab)*", r""))

In [None]:
print(re.search(r"(ab)+", r""))

**Practice 1:** simplify the following regular expression: `ab(ab)*`.

**Practice 2:** for the following regular expressions, tell if the given strings are fully matched by them.

    Regular expression:  a*b*a*
    Strings:             "", aaaaa, bbbb, aaabbbbb, aabaaa, aaabbbabbb, bbbaa, bab
    
    Regular expression:  a*b+a*
    Strings:             "", aaaaa, bbbb, aaabbbbb, aabaaa, aaabbbabbb, bbbaa, bab
    
    Regular expression:  (a*b*)*
    Strings:             "", aabb, aaabbb, aabbaaaabbbb, bbbb, baaaaa, bababbabb
    
    Regular expression:  (a+b*)*
    Strings:             "", aabb, aaabbb, aabbaaaabbbb, bbbb, baaaaa, bababbabb
    
    Regular expression:  (a+c*b+)+
    Strings:             "", ab, abab, aaaacc, bbba, aaabbbabbbb, aabbbbbcbbb
    
    Regular expression:  (a*c*b+)+
    Strings:             "", ab, abab, aaaacc, bbba, aaabbbabbbb, aabbbbbcbbb
    
**Practice 3:** come up with a regex describing the following two languages. Test it in the following cells.

    Language 1:  "", abbc, bbbcc, ccc, acc, abbccc...
    Language 2:  abccc, a, abcbcccc, abbb, abbbccc...

In [None]:
lang_1 = ["", "abbc", "bbbcc", "ccc", "acc", "abbccc"]
for s in lang_1:
    print(re.search(r"REGEX", s))

In [None]:
lang_2 = ["abccc", "a", "abcbcccc", "abbb", "abbbccc", "abcbcbc"]
for s in lang_2:
    print(re.search(r"REGEX", s))

### Matching one of the list of characters

In order to match one of some list of characters, we can enclose those characters in square parenthesis:

    r"...[abc]..."
    
The regular expression above will match "a", "b" or "c".

In [None]:
print(re.search(r"m[ae]n", r"woman"))
print(re.search(r"m[ae]n", r"women"))

If you need to match a set of characters containing some range of digits of alphabet symbols, these ranges can be provided in a shortcut way: `[1-7]`, `[a-z]` or `[A-D]`.

In [None]:
print(re.findall(r"[1-5]", r"1945"))

Of course, Kleene star or plus can be applied to any set of symbols:

In [None]:
print(re.findall(r"[0-9]+", r"It repeated 518305 times!"))

Note that these lists are case-sensitive:

In [None]:
print(re.findall(r"[a-z]", r"ABBA"))
print(re.findall(r"[A-Z]", r"ABBA"))

If we want to perform matches without case-sensitivity, we can add one more argument to the function `re.findall`. This argument is `re.I`, a short form for `re.IGNORECASE`.

In [None]:
print(re.findall(r"[a-z]", r"ABBA", re.I))
print(re.findall(r"[A-Z]", r"ABBA"))

### Matching the beginning/end of string

If we want to make sure that the regualar expression is not only contained within the string we are searching, but rather describes the whole string, we can surround the regex with the following two symbols:

  * `^` marks the beginning of the string;
  * `$` marks the end of the string. 
  
Of course, these two symbols can be used separately as well in order to match the beginning or the end of the string.

In [None]:
print(re.search(r"m[ae]n", r"woman"))
print(re.search(r"^m[ae]n$", r"woman"))
print(re.search(r"m[ae]n$", r"woman"))

### Any character `.`

A dot `.` stands for _any character_.

In [None]:
print(re.search(r"m.n", r"man"))
print(re.search(r"m.n", r"men"))

However, it will match just a single character:

In [None]:
print(re.search(r"m.n", r"moon"))

Then, applying Kleene star to a `.` will yield the original string.

In [None]:
print(re.search(r".*", r"moon"))

### Matching certain number of times

To match some symbol $n$ times, we can add curly parenthesis `{}` after that symbol, and put that $n$ inside: `s{n}`.

In [None]:
print(re.search(r"mo{2}n", r"moon"))

We can also provide a range of the times we want to match that symbol. The range is denoted as `s{m,n}`, where `m` and `n` are the beginning and the end of the range.

In [None]:
print(re.search(r"mo{2,5}n", r"moooon"))

In we want to check a repetition of a group of symbols, we can enclose that group in the round parenthesis `()`:

In [None]:
print(re.search(r"fa(la){2,5}fel", r"falalalafel"))

### Matching groups of symbols

However, if we want to match one of the listed groups of symbols, it will not work to simply put those groups in parenthesis:

In [None]:
print(re.findall(r"[(la)(lo)]", r"lola"))

The symbols in the square parenthesis are interpreted literally, and therefore the expression above will match parenthesis too:

In [None]:
print(re.findall(r"[(la)(lo)]", r"hello (world)"))

To match one of the listed groups of symbols, we can join those groups using a special symbol `|` meaning **or**.

In [None]:
print(re.search(r"la|lo", r"lola"))
print(re.search(r"(la|lo){2}", r"lola"))
print(re.search(r"ba(la|lo){1,2}on", r"baloon"))

### Optionality

Optionality of a symbol or a group of symbols can be expressed via putting a question mark `?` after that symbol or a group.

In [None]:
print(re.search(r"python(ic)?", r"python"))
print(re.search(r"python(ic)?", r"pythonic"))
print(re.search(r"(py)?th?on(ic)?", r"pythonic"))
print(re.search(r"(py)?th?on(ic)?", r"tonic"))

### Negation

If symbols from some set should **not** be present in-between two characters, we can use a negation marker `^` used inside the set parenthesis: `[^...]`.

In [None]:
print(re.search(r"m[^bc]d", r"mad"))
print(re.search(r"m[^bc]d", r"mcd"))

**Practice.** Find a refular expression that will prohibit any sequence made out of letters "b" and "c" in-between "m" and "d".

    "md"        ->   Match
    "mod"       ->   Match
    "mcd"       ->   No match
    "mbcd"      ->   No match
    "mbbcbcd"   ->   No match
    "mcccccc"   ->   No match

## Lookaround assertions

Sometimes we want to match a string only in a particular position. In this case, we can use different types of lookaheads and lookbehinds:

  * `(?<=foo)` **lookbehind**: makes sure that a string "foo" precedes the pattern we are matching;
  * `(?=foo)`  **lookahead**:  makes sure that a string "foo" follows the pattern we are matching;
  * `(?<!foo)` **negative lookbehind**: makes sure that a string "foo" does not precede the pattern we are matching;
  * `(?!foo)`  **negative lookahead**:  makes sure that a string "foo" does not follow the pattern we are matching.
  
Lookbehinds check that the regular expression is preceded by a certain regular expression.

In [None]:
print(re.search(r"(?<=hello )world", r"hello world!"))
print(re.search(r"(?<!hello )world", r"goodbye world!"))
print(re.search(r"(?<!hello )world", r"hello world!"))

Similarly, lookaheads check that the regular expression is followed by a certain regular expression.

In [None]:
print(re.search(r"hello(?= world)", r"hello world!"))
print(re.search(r"hello(?! world)", r"hello people!"))
print(re.search(r"hello(?! world)", r"hello world!"))

**Practice.** You are given the following string.

In [None]:
string = "a big giraffe came out of the water"

Match all words consisting of $2$ or $3$ symols in `string`. You should expect to get the following output:

    ['big', 'out', 'of', 'the']

As a sidenote, it is a good style to use as least of the lookaround expressions as possible. In simple words, when processing a lookahead or a lookbehind, regex compiler "pretends" that it is another regular expression that needs to be matched, attempts the match, and then needs to go back to the position where the assumption was made (this is why they are called _assertion operations)._ This backtracking operation is pretty expensive, and given large lookaround expressions and/or long strings to be searched, it can slow down the code by a lot.

### Shortcuts

Additionally, there is a list of shortcuts to the collections of symbols of some type. Here are some of them:

  * `\d` matches digits, equivalent to `[1-9]`;
  * `\D` matches non-digits, equivalent to `[^1-9]`;
  * `\s` matches any whitespace character, equivalent to `[ \t\n\r\f\v]`;
  * `\S` matches any whitespace character, equivalent to `[ \t\n\r\f\v]`;
  * `\w` matches any alphanumeric character, equivalent to `[a-zA-Z0-9_]`;
  * `\W` matches any alphanumeric character, equivalent to `[^a-zA-Z0-9_]`.

In [None]:
sentence = "this sentence, for example, contains punctuations and 1 digit. regular expressions can help us to tokenize this sentense!"
print(re.findall(r"\w+", sentence))

Additionally, the marker `\b` indicates word boundaries.

In [None]:
words = ["visiting", "inverse", "within", "in", "x visiting x", "x inverse x", "x within x", "x in x"]

for w in words:
    print(w, "  -->  ", re.search(r"\bin\b", w))

Everything listed below is just a small part of the functionality of regular expressions. It is useful to remember the basic principles, however, if something more advanced is required, you can always refer to cheat sheets or tutorials, such as [this one](https://www.rexegg.com/regex-quickstart.html).

# Homework 10

**Due on Sunday, November 10th, 11.59pm**

Send your notebook (don't forget to save your solutions!) to <alena.aksenova@stonybrook.edu> with the subject **\[CompLing1\] Homework 10**.

**Problem.** You are given the following string.

In [None]:
string = "The first number is Mary 528973, the other one is Peter 29857, and the last one is Mira 245289."

First, extract a list of numbers from this `string`.

    Expected output: ['528973', '29857', '245289']

Second, extract a list of names from the `string`. Notice that every name is followed by a number.

    Expected output: ['Mary', 'Peter', 'Mira']

Finally, merge these two lists together into a dictionary.

    Expected output: {'Mary': '528973', 'Peter': '29857', 'Mira': '245289'}