<center><img src=img/MScAI_brand.png width=70%></center>

# Regular Expressions


In Computer Science, a (formal) **language** is defined as *a set of strings*. 

`L = {"b", "ab", "aab", "aaab"}` is a language defined by enumeration.


`L = {strings of length 1...4, starting with zero or more 'a's and ending in exactly 1 'b'}` is the same language defined by a rule.


`L = {valid email addresses according to RFC2822}` is a language. https://tools.ietf.org/html/rfc2822


`L = {valid Python files}` is a language too, defined partly by Python docs and also, partly, perhaps implicitly by the behaviour of the Python interpreter.

There are two main things we might want to do with languages.
1. *Recognize* strings, i.e. check whether a given string is in the language;

  a. Just check, i.e. return True/False;
  
  b. Check, and (assuming True), extract some of the structure or contents of the string;
2. *Generate* strings, i.e. generate 1 or more random strings from the language.

There are several different techniques suitable for *recognising* different types of languages:

* manual testing against explicit rules;
* regular expressions;
* grammars;
* finite state machines.

### Recognising and extracting structure: some problems

* Validate user IDs, credit card numbers, post-codes, etc.
* Extract all email addresses from a text document
* Extract all the html tags from a html document
* Extract all the docstrings from Python code
* Check whether a URL is blacklisted
* Syntax-highlighting code
* Detecting repeated words in text, e.g. common typo "the the", "an an"
* Advanced find and replace mode in text editors/IDEs.

### Regular expressions


In the 1960s and 1970s programmers realised that they were solving
problems like this over and over, so they started to use regular
expressions - a powerful, general method for matching strings against patterns, i.e. recognising (a certain type of) formal language.


Replace: manually-written, error-prone, wheel-reinventing code

$\rightarrow$ a single formalism, a purpose-built mini-language, a single library implementation

"Some people, when confronted with a problem, think “I know,
I'll use regular expressions.”  Now they have two problems." -- Jamie Zawinski

In [1]:
# the module is just called re
import re

### A quick example: detect all opening HTML tags

In [7]:
p = "<[^/].*?>"
s = "<a href=test.com><font size=1>Some text></font></a>"
re.findall(p, s)

['<a href=test.com>', '<font size=1>']

What we see: the **pattern** `p` is a string in a "domain-specific language" (a small language with its own, specialised syntax).

`s` is the target string to be matched.

### Ways of using REs

* `re.match(p, s)`: check whether pattern `p` matches part of string `s` **from the start**
* `re.search(p, s)`: check whether `p` matches any part of `s`
* `re.findall(p, s)`: find all matches of `p` in `s`
* `re.split(p, s)`: like `str.split` but splits wherever `p` matches part of `s`
* `re.sub(p, r, s)`: for every match `p` in `s`, replace by `r` (`r` could be a string or function).

### Raw strings

We previously saw f-strings `f""` which are useful for formatted output: 
```python
f"a = {a}"
```

A **raw string** uses a similar syntax `r""`. Inside a raw string, backslash-escapes such as `\n` and `\t` are not processed, they just remain as-is.

In [10]:
print("not raw", 'a\tb\tc\nd\\e')
print("raw",    r'a\tb\tc\nd\\e')

not raw a	b	c
d\e
raw a\tb\tc\nd\\e


This is handy when writing RE patterns, because they often contain backslashes anyway. In a normal string, to write a backslash, we'd need to write a double-backslash. It would get confusing.

### RE syntax

* Most characters just match themselves
* Wildcard `.` matches any one character
* Repetition markers `?`, `*`, `+`, `{n}`
* Match one of several alternatives marked with `|`
* Abbreviations like `\d`, `\s`, `\w` for character sets
* Custom sets like `[0-9a-f]` for hexadecimal
* Non-greedy marker `?`
* Backslash-escapes for special chars, e.g. `\.`
* Match beginning `^` or end `$`

(In the video presentation of this notebook, we will now go to https://regex101.com/, an amazing resource for learning, writing, and debugging REs.)

### Literals, wildcards, and repetition

In [49]:
s = "ax abx abbx abbbx abbbbx"

In [56]:
p = r"ab"
re.findall(p, s)

['ab', 'ab', 'ab', 'ab']

In [57]:
p = r"ab?x"
re.findall(p, s)

['ax', 'abx']

In [49]:
s = "ax abx abbx abbbx abbbbx"

In [59]:
p = r"ab*x"
re.findall(p, s)

['ax', 'abx', 'abbx', 'abbbx', 'abbbbx']

In [58]:
p = r"ab+x"
re.findall(p, s)

['abx', 'abbx', 'abbbx', 'abbbbx']

In [49]:
s = "ax abx abbx abbbx abbbbx"

In [60]:
p = r"a.x"
re.findall(p, s)

['abx']

In [61]:
p = r"a.*x"
re.findall(p, s)

['ax abx abbx abbbx abbbbx']

In [49]:
s = "ax abx abbx abbbx abbbbx"

In [63]:
p = r"ab{4}x"
re.findall(p, s)

['abbbbx']

### Custom classes

It is common to want to match any character from a particular set. We do that using `[]`.

In [4]:
p = r"[-=+/*]+"
s = "x += 5 + 3 / 17"
re.findall(p, s)

['+=', '+', '/']

The syntax `a-z` or similar inside the `[]` means to take `a`, `z`, and all characters between. It works for `A-Z` and `0-9` also. 

(In the previous example, we saw `-` inside `[]` but not inside a range like `a-z`.)

In [66]:
p = r"[a-zA-Z]+"
s = "9080true6030false"
re.findall(p, s)

['true', 'false']

The character `^` as the **first** character inside `[]` means **not**, i.e. we now match anything **other than** what is inside `[]`.

In [112]:
p = "<[^/].*?>" # reject closing tags </
s = "<a href=test.com><font size=1>Some text></font></a>"
re.findall(p, s)

['<a href=test.com>', '<font size=1>']

### Example: hexadecimal

As we know, a hexadecimal digit is a digit 0-9 or letter a-f (or A-F). Here is an RE to match 1 or more hexadecimal digits.

In [77]:
p = r"[0-9a-fA-F]+"
s = "ACDC1979, Bach1741, ABBA1980"
re.findall(p, s)

['ACDC1979', 'Bac', '1741', 'ABBA1980']

### Abbreviations for common classes
* `\d`: any digit, i.e `[0-9]`; 
* `\D`: any nondigit, i.e. `[^0-9]`;
* `\s`: any whitespace;
* `\S`: any non-whitespace;
* `\w`: "alphanumeric" (including `_`) suitable for use in variable names;
* `\W`: non-"alphanumeric".

**Exercise**: Strong Password Detection. Given a password as a string, say how strong it is. Give it "+20% strength" for each:
* Longer than 8 characters
* Contains lower-case
* Contains upper-case
* Contains one or more digits
* Contains one or more non-alphanumerical characters


In [114]:
import re
def strong_pw(s):
    strength = 0
    if len(s) > 8:
        strength += 20
    if re.search(r"\d", s) is not None:
        strength += 20
    if re.search(r"[a-z]", s) is not None:
        strength += 20
    if re.search(r"[A-Z]", s) is not None:
        strength += 20
    if re.search(r"\W", s) is not None or "_" in s:
        strength += 20
    return strength
strong_pw("MyTe$tPassw0rd")

100

### Anchors `^` and `$`
These characters "anchor" the match to the start and the end of a line respectively.

In [91]:
s = "a b c"
p = r"b"
re.findall(p, s)

['b']

In [92]:
s = "a b c"
p = r"^b"
re.findall(p, s)

[]

### Example: valid variable names

Which of these are valid variable names in Python?

In [83]:
ss = ["x", "x17", "n_chars", "n chars", "_", "_iters", "17x"]

The first character must be a letter or underscore, and after that letters, digits and underscores are allowed.

In [97]:
ss = ["x", "x17", "n_chars", "n chars", "_", "_iters", "17x"]
p = r"^[a-zA-Z_]\w*$"
for s in ss:
    print(repr(s), ":", re.match(p, s))

'x' : <re.Match object; span=(0, 1), match='x'>
'x17' : <re.Match object; span=(0, 3), match='x17'>
'n_chars' : <re.Match object; span=(0, 7), match='n_chars'>
'n chars' : None
'_' : <re.Match object; span=(0, 1), match='_'>
'_iters' : <re.Match object; span=(0, 6), match='_iters'>
'17x' : None


To correctly reject `"n chars"` which contains a space, we must avoid matching *part* of the string, so we use anchors to insist we match all of the string.

### Backslash-escapes

As we have seen, in the RE domain-specific language, many "metacharacters" have special meanings, e.g. `. ^ $ * + ? { } [ ] \ | ( )`.

Just like in Python strings, if we want to write a special character we have to escape it with `\`. 

Below is a simple/incomplete email address pattern. As you can see it matches any alphanumeric string followed by `@` followed by an alphanumeric string followed by `\.ie` to match the literal `.ie`. 

In [100]:
s = "james@nuigalway.ie"
p = r"\w+@\w+\.ie"
re.match(p, s)

<re.Match object; span=(0, 18), match='james@nuigalway.ie'>

### Groups

Remember, sometimes we want not just to validate/match the string, but to extract some structure it. We can use **groups** for this, with the syntax `()`.

In the email address, let's extract the username, i.e. the part before `@`.

In [103]:
s = "james@nuigalway.ie"
p = r"(\w+)@\w+\.ie"
re.match(p, s).group(1)

'james'

Group 0 is the whole match. Group 1 is the first group. 

Groups can even be nested!

Groups are also useful when we want to apply a repetition operator to a sub-pattern rather than a single character.

In [121]:
s = "abccc"
p = r"abc{3}"
re.match(p, s)

<re.Match object; span=(0, 5), match='abccc'>

In [122]:
s = "abcabcabc"
p = r"(abc){3}"
re.match(p, s)

<re.Match object; span=(0, 9), match='abcabcabc'>

However, beware -- `re.findall` changes behaviour if it sees any group syntax `()`, and will return the groups it finds, instead of the full matches.

### Alternatives with `|`

The "Or" symbol `|` allows a match of either its left- or right-hand side. If you want to match something, and then one of two alternatives, put the alternatives inside a group. (Otherwise the "something" becomes part of the left-hand side.)

In [123]:
ss = ["10a", "10b", "11a", "11b", "11c"]
p = r"\d+(a|b)"
for s in ss:
    print(repr(s), re.match(p, s))

'10a' <re.Match object; span=(0, 3), match='10a'>
'10b' <re.Match object; span=(0, 3), match='10b'>
'11a' <re.Match object; span=(0, 3), match='11a'>
'11b' <re.Match object; span=(0, 3), match='11b'>
'11c' None


### Example: MAC address

As we know, a MAC address or hardware address is a string like `00:A0:C9:14:C8:29` -- 6 pairs of hexadecimal digits. Write an RE to match MAC addresses.


In [19]:
p = r"([0-9a-fA-F]{2}:){5}[0-9a-fA-F]{2}"
s = "00:A0:C9:14:C8:29"
re.match(p, s)

<re.Match object; span=(0, 17), match='00:A0:C9:14:C8:29'>

### Greedy expressions


As we know, the the `?` character means "zero or one repetition of preceding", e.g.: `ab?` matches `a` and matches `ab`.

The `?` character can be used as a **non-greedy** modifier to `*` or to `+`. It tells them: instead of matching **as much as possible**, match **as little as possible**. 

(The following example from [Stackoverflow](https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions)).

In [22]:
s = "<em>Hello World</em>"
p = r"<.+>"
re.findall(p, s)

['<em>Hello World</em>']

In [124]:
s = "<em>Hello World</em>"
p = r"<.+?>"
re.findall(p, s)

['<em>', '</em>']

### Ways of using REs

* `re.match(p, s)`: check whether pattern `p` matches part of string `s` **from the start**
* `re.search(p, s)`: check whether `p` matches any part of `s`
* `re.findall(p, s)`: find all matches of `p` in `s`
* `re.split(p, s)`: like `str.split` but splits wherever `p` matches part of `s`
* `re.sub(p, r, s)`: for every match `p` in `s`, replace by `r` (`r` could be a string or function).

### Example: tokenizing text

Write a pattern which matches a single word, and thus use `re.findall` to tokenize this text, discarding punctuation.

In [108]:
s = "It was the best of times, it was the worst of times."
p = r"\w+"
re.findall(p, s)

['It',
 'was',
 'the',
 'best',
 'of',
 'times',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times']

Do the same thing the other way round: write a pattern that matches any non-word stuff, and use `re.split`.

In [110]:
p = r"[^\w]+"
re.split(p, s)

['It',
 'was',
 'the',
 'best',
 'of',
 'times',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times',
 '']

Further reading/reference: 

* Python RE HOWTO https://docs.python.org/3/howto/regex.html
* Vanderplas, pp 76-83
* Python `re` module docs https://docs.python.org/3/library/re.html
* https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/
* Automate the Boring Stuff, ch 7 https://automatetheboringstuff.com/chapter7/
