## Lesson 09 - Regular Expressions

Resources:

* [Regular Expression HOWTO (for Python)](https://docs.python.org/3/howto/regex.html)
* [Drew's Grep Tutorial](http://www.uccs.edu/~ahitchco/grep/), [Linux Grep Tutorial](http://ryanstutorials.net/linuxtutorial/grep.php)
* [Perl 5 Regex Cheat Sheet](https://perlmaven.com/regex-cheat-sheet)

Today we will cover:

* What are regular expressions?
* Regular expression syntax
* `sed` examples (use `sed -E`)
* `grep` examples (use `egrep` or `grep -E`)
* Python examples: `re` module (note there are also built-in methods of objects)

### What are regular expresions?

Regular expressions are used to match patterns in strings. We build regular expressions to match exactly the possible string permutations we want *and nothing more*.

Here are some examples of regular expressions:

`.ob` -- matches `Bob`, `sob`, `8ob`, `!ob`, ....<br>
`Lo*l` -- matches `Ll`, `Lol`, `Looooool`, ....<br>
`Lo?l` -- matches `Ll` and `Lol`.<br>
`A.*e` -- matches `Ae`, `Abe`, `Arrrrrre`, ....<br> 
`[BFG]ad` -- matches `Bad`,  `Fad`, and `Gad`.<br>
`[0-9]*\.[0-9]*` -- matches `3.14159`, `1000.0`, ....<br>


### Regular expression syntax

#### Basic character classes

`a` -- A single character (here 'a').<br>
`.` -- Any character except newline.

#### Alternation

`a|b` -- Match first part or second part (here 'a' or 'b').

#### Character classes

`[bgh.]` -- One of the characters listed in the character class b,g,h or . in this case.<br>
`[b-h]` -- Same as [bcdefgh].<br>
`[a-z]` -- Lower case Latin letters.<br>
`[bc-]` -- The characters b, c or - (dash).<br>
`[^bx]` -- Negated character class. Anything except b or x.

#### Predefined character class abbreviations

Construct        | Equivalent class | Negated construct   | Equivalent negated class
-----------------|------------------|---------------------|-------------------------
`\d` (a digit)   | `[0-9]`          | `\D` (digits, not!) | `[^0-9]`
`\w` (word char) | `[a-zA-Z0-9_]`   | `\W` (words, not!)  | `[^a-zA-Z0-9_]`
`\s` (space char)| `[ \r\t\n\f]`    | `\S` (space, not!)  | `[^ \r\t\n\f]`

#### Multipliers

Multiplier symbol | Bracket notation | Number of instances matched
:----------------:|:----------------:|:-----------------:
`*`               | `{0,}`           | 0, 1, 2, ...     
`+`               | `{1,}`           | 1, 2, 3, ...      
`?`               | `{0,1}`          | 0 or 1            
                  | `{2}`            | 2                 
                  | `{2,}`           | 2, 3, 4, ...      
                  | `{1,3}`          | 1, 2 or 3        

#### Grouping and capturing

`(...)` -- Grouping and capture matching substring.<br>
`\1`, `\2` -- Use substring in replacement (sed).<br>
`$1`, `$2` -- Use substring in replacement (perl).

#### Anchors

`^` -- Beginning of string.<br>
`$` -- End of string.

### `sed` examples (use `sed -E`)

`sed` stands for 'stream editor'. It accepts standard input, modifies it, and prints to standard output.

The standard syntax for `sed` is `s/find/replace/g`, where `g` (optional) means make the change everywhere the string is found.

In [1]:
%%bash
echo "AaBbBbCcCcCcDdDdDdDd
Aa
BbBb
CcCcCc
DdDdDdDd" > test_regex.txt

In [2]:
cat test_regex.txt

AaBbBbCcCcCcDdDdDdDd
Aa
BbBb
CcCcCc
DdDdDdDd


In [3]:
%%bash
sed 's/Cc//g' test_regex.txt

AaBbBbDdDdDdDd
Aa
BbBb

DdDdDdDd


In [4]:
%%bash
sed 's/Dd/Zz/g' test_regex.txt

AaBbBbCcCcCcZzZzZzZz
Aa
BbBb
CcCcCc
ZzZzZzZz


In [5]:
%%bash
sed 's/.*(Dd)*/XXX/g' test_regex.txt

AaBbBbCcCcCcDdDdDdDd
Aa
BbBb
CcCcCc
DdDdDdDd


In [6]:
%%bash
sed 's/.*\(Dd\)*/XXX/g' test_regex.txt

XXX
XXX
XXX
XXX
XXX


In [7]:
%%bash
sed -E 's/.*(Dd)*/XXX/g' test_regex.txt

XXX
XXX
XXX
XXX
XXX


In [8]:
%%bash
sed -E 's/.*(Dd)+/XXX/g' test_regex.txt

XXX
Aa
BbBb
CcCcCc
XXX


### `grep` examples (use `egrep` or `grep -E`)

`grep` comes from the phrase 'Global search for Regular Expression and Print matching lines'.

In [9]:
%%bash
cat test_regex.txt

AaBbBbCcCcCcDdDdDdDd
Aa
BbBb
CcCcCc
DdDdDdDd


In [10]:
%%bash
egrep "Aa" test_regex.txt

AaBbBbCcCcCcDdDdDdDd
Aa


In [11]:
%%bash
egrep "^Cc" test_regex.txt

CcCcCc


In [12]:
%%bash
egrep "Cc{3}" test_regex.txt

In [13]:
%%bash
egrep "(Cc){3}" test_regex.txt

AaBbBbCcCcCcDdDdDdDd
CcCcCc


In [14]:
%%bash
egrep "([A-Z][a-z]){3}" test_regex.txt

AaBbBbCcCcCcDdDdDdDd
CcCcCc
DdDdDdDd


### Regular expressions in Python

Adapted from [TutorialsPoint](http://www.tutorialspoint.com/python/python_reg_expressions.htm).

A *regular expression* is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in the UNIX world.

The module `re` provides full support for Perl-like regular expressions in Python. The `re` module raises the exception `re.error` if an error occurs while compiling or using a regular expression.

We use three important functions -- `match`, `search` and `replace` -- to handle regular expressions. But a small thing first: There are various characters that would have special meaning if they were used in regular expressions. To avoid any confusion while dealing with regular expressions, we use raw strings as in `r'expression'`.

To use regular expressions in Python, use the `re` module:

In [15]:
import re

#### The `search` function

This function searches for first occurrence of RE *pattern* within *string* with optional *flags*.

Here is the syntax for this function:

	re.search(pattern, string, flags=0)
    
Here is the description of the parameters:

| Parameter | Description |
|:----------|:------------|
| pattern   | This is the regular expression to be matched.|
| string    | This is the string, which would be searched to match the pattern at the beginning of string.|
| flags     | Several modifiers are available, but the most common one is `re.I` for case-insensitive search.|

Both `re.search` and `re.match` function returns a match object on success, None on failure. We use `group(num)` or `groups()` function of match object to get matched expression:

| Match Object Method | Description |
|:--------------------|:------------|
| group(num=0)        | This method returns entire match (or specific subgroup num) |
| groups()            | This method returns all matching subgroups in a tuple (empty if there weren't any) |

Note: There is a similar function `match` that will match only if the string begins with the regular expression. This function is unnecessary; we can simply use `search` and begin our regular expression with a `^`.

In [16]:
# Example 1

x = 'Begin With Review And Friend'

In [17]:
re.search(r'W[a-z]*', x)

<_sre.SRE_Match object; span=(6, 10), match='With'>

In [18]:
re.search(r'W[a-z]*', x).group()

'With'

In [19]:
y = re.search(r'W[a-z]*', x)

In [20]:
y.group()

'With'

In [21]:
re.match(r'W[a-z]*', x)

In [22]:
re.match(r'Begin', x)

<_sre.SRE_Match object; span=(0, 5), match='Begin'>

In [23]:
re.search(r'^Begin', x)

<_sre.SRE_Match object; span=(0, 5), match='Begin'>

In [24]:
re.search(r'^With', x)

In [25]:
# Example 2

line = "3.8 Liters in 1 Gallon"

searchObj = re.search(r'([.0-9]*) liters in ([.0-9]*) gallon', line, re.I)

In [26]:
searchObj.groups()

('3.8', '1')

In [27]:
searchObj.group()

'3.8 Liters in 1 Gallon'

In [28]:
if searchObj:
   print("searchObj.group() : ", searchObj.group())
   print("searchObj.group(1) : ", searchObj.group(1))
   print("searchObj.group(2) : ", searchObj.group(2))
else:
   print("Nothing found!!")

searchObj.group() :  3.8 Liters in 1 Gallon
searchObj.group(1) :  3.8
searchObj.group(2) :  1


#### The sub Function

One of the most important re methods that use regular expressions is `sub`, which allows you to search and replace.

Syntax:

	re.sub(pattern, repl, string, max=0)

This method replaces all occurrences of the RE *pattern* in *string* with *repl*, substituting all occurrences unless *max* provided. This method returns modified string.

In [29]:
# Example 3

phone = "2004-959-559 # This is Phone Number"

In [30]:
# delete Python-style comments
num = re.sub(r'#.*$', "", phone)
num

'2004-959-559 '

In [31]:
num.strip()

'2004-959-559'

In [32]:
# remove anything other than digits
num = re.sub(r'\D', "", phone)    
num

'2004959559'

#### The compile Function

Regular expressions can be compiled into pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

In [33]:
# Example 4

p = re.compile('ab*')
p

re.compile(r'ab*', re.UNICODE)

In [34]:
m = p.search('Drab')
m

<_sre.SRE_Match object; span=(2, 4), match='ab'>

In [35]:
m.group()

'ab'

In [36]:
m.span()

(2, 4)

In [37]:
m.start()

2

In [38]:
m.end()

4

In [39]:
# Example 5

q = re.compile('[a-z]+')
n = q.match('tempo')

In [40]:
q

re.compile(r'[a-z]+', re.UNICODE)

In [41]:
n

<_sre.SRE_Match object; span=(0, 5), match='tempo'>

In [42]:
n.group()

'tempo'

In [43]:
n.start(), n.end()

(0, 5)

#### The findall Function

If you want to find all examples of a regular expression in a string, `findall` will do this for you and return the results as a list. The following example will find all the lowercase letters in a string:

In [44]:
# Example 6

regex = re.compile('[a-z]')
string = 'Here Is A String'
letters = regex.findall(string)
letters

['e', 'r', 'e', 's', 't', 'r', 'i', 'n', 'g']

In [45]:
regex = re.compile(r'[A-Z][a-z]*')
string = 'Here Is A String'
words = regex.findall(string)
words

['Here', 'Is', 'A', 'String']

### Appendix: Python regular expression cheat sheet

#### Regular Expression Patterns

Except for control characters, `(+ ? . * ^ $ ( ) [ ] { } | \)`, all characters match themselves. You can escape a control character by preceding it with a backslash.

Following table lists the regular expression syntax that is available in Python:

| Pattern | Description
|:----------|:------------|
| ^ | Matches beginning of line.
| `$` | Matches end of line.
| . | Matches any single character except newline. Using m option allows it to match newline as well.
| [...] | Matches any single character in brackets.
| [^...] | Matches any single character not in brackets
| re* | Matches 0 or more occurrences of preceding expression.
| re+ | Matches 1 or more occurrence of preceding expression.
| re? | Matches 0 or 1 occurrence of preceding expression.
| re{n} | Matches exactly n number of occurrences of preceding expression.
| re{n,} | Matches n or more occurrences of preceding expression.
| re{n, m} | Matches at least n and at most m occurrences of preceding expression.
| `a` &#124; `b` | Matches either a or b.
| (re) | Groups regular expressions and remembers matched text.
| \w | Matches word characters.
| \W | Matches nonword characters.
| \s | Matches whitespace. Equivalent to [\t\n\r\f].
| \S | Matches nonwhitespace.
| \d | Matches digits. Equivalent to [0-9].
| \D | Matches nondigits.
| \A | Matches beginning of string.
| \Z | Matches end of string. If a newline exists, it matches just before newline.
| \z | Matches end of string.
| \G | Matches point where last match finished.
| \b | Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
| \B | Matches nonword boundaries.
| \n, \t, etc. | Matches newlines, carriage returns, tabs, etc.
| \1...\9 | Matches nth grouped subexpression.
| \10 | Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.

#### Regular Expression Examples

| Example | Description
|:----------|:------------|
| python | Match "python".
| [Pp]ython | Match "Python" or "python"
| rub[ye] | Match "ruby" or "rube"
| [aeiou] | Match any one lowercase vowel
| [0-9] | Match any digit; same as [0123456789]
| [a-z] | Match any lowercase ASCII letter
| [A-Z] | Match any uppercase ASCII letter
| [a-zA-Z0-9] | Match any of the above
| [^aeiou] | Match anything other than a lowercase vowel
| [^0-9] | Match anything other than a digit

#### Special Character Classes

| Example | Description
|:----------|:------------|
| . | Match any character except newline
| \d | Match a digit: [0-9]
| \D | Match a nondigit: [^0-9]
| \s | Match a whitespace character: [ \t\r\n\f]
| \S | Match nonwhitespace: [^ \t\r\n\f]
| \w | Match a single word character: [A-Za-z0-9_]
| \W | Match a nonword character: [^A-Za-z0-9_]

#### Repetition Cases

| Example | Description
|:----------|:------------|
| ruby? | Match "rub" or "ruby": the y is optional
| ruby* | Match "rub" plus 0 or more ys
| ruby+ | Match "rub" plus 1 or more ys
| \d{3} | Match exactly 3 digits
| \d{3,} | Match 3 or more digits
| \d{3,5} | Match 3, 4, or 5 digits

#### Nongreedy repetition

This matches the smallest number of repetitions.

| Example | Description|
|:------|:-----------|
| <.\*> | Greedy repetition: matches "<python>perl>"|
| <.*?> | Nongreedy: matches "<python>" in "<python>perl>"|

#### Grouping with Parentheses

| Example | Description|
|:----------|:------------|
| \D\d+ | No group: + repeats \d
| (\D\d)+ | Grouped: + repeats \D\d pair
| ([Pp]ython(, )?)+ | Match "Python", "Python, python, python", etc.

#### Backreferences

This matches a previously matched group again:

| Example | Description |
|:----------|:------------|
|([Pp])ython&\1ails | Match python&pails or Python&Pails |
|(['"])[^\1]*\1 | Single or double-quoted string. \1 matches whatever the 1st group matched. \2 matches whatever the 2nd group matched, etc. |

#### Alternatives

| Example | Description
|:----------|:------------|
| python&#124;perl | Match "python" or "perl"
| rub(y&#124;le)) | Match "ruby" or "ruble"
| Python(!+&#124;\?) | "Python" followed by one or more ! or one ?

#### Anchors

This needs to specify match position.

| Example | Description
|:----------|:------------|
| ^Python | Match "Python" at the start of a string or internal line
| Python$ | Match "Python" at the end of a string or line
| \APython | Match "Python" at the start of a string
| Python\Z | Match "Python" at the end of a string
| \bPython\b | Match "Python" at a word boundary
| \brub\B | \B is nonword boundary: match "rub" in "rube" and "ruby" but not alone
| Python(?=!) | Match "Python", if followed by an exclamation point.
| Python(?!!) | Match "Python", if not followed by an exclamation point.

#### Special Syntax with Parentheses

| Example | Description
|:----------|:------------|
| R(?#comment) | Matches "R". All the rest is a comment
| R(?i)uby | Case-insensitive while matching "uby"
| R(?i:uby) | Same as above
| rub(?:y&#124;le)) | Group only without creating \1 backreference