# Regular Expressions

## Outline
- [Overview](#overview)
- [Regular Expression Matching](#regular-expression-matching)
- [Modifying Strings](#modifying-strings)

<a id="overview"></a>
## Overview

*Regular expressions* (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the [re](https://docs.python.org/3/library/re.html) module. 

The Python `re` module provides full support for [Perl](https://www.perl.org/)-like regular expression on Python. 
Through the functions in the `re` module, you specify the rules of the set of possible strings that you want to match or find other strings or sets of strings, using a specialized syntax held in a pattern. 
You can also modify a string or split it apart in various ways. 

The page [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html#regex-howto) provides an introductory tutorial to using regular expressions in Python with the `re` module. 

From here on, we will assume that the following import has taken place:

In [None]:
import re

<a id="regular-expression-matching"></a>
## Regular Expression Matching

The first thing to recognize when using regular expressions is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters (also known as a string). Most patterns use normal ASCII, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!, but unicode characters can also be used to match any type of international text.

The general steps in using reguar expression to search for a pattern match are:

1. Create a `Regex` object with the `re.compile()` function
1. Pass the string you want to search into the `Regex` object's `search()` method - Remember to use a raw string
1. Call the `match` object's `group` method to return a string of the actual matched text

For example, 

In [None]:
line = 'Cats are smarter than dogs'

dogRegex = re.compile('dog')
searchObj = dogRegex.search(line)
searchObj.group()

`re.compile()` also accepts an optional `flags` argument to enable various special features and syntax variations.
For example,

In [None]:
line = 'Dogs are smarter than cats'

dogRegex = re.compile('dog', re.IGNORECASE)
searchObj = dogRegex.search(line)
searchObj.group()

You can also call the `re.search()` method directly. 

The `re.search()` method returns `None` if no match can be found. 
If successful, a `re.Match` object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

Try the following:

In [None]:
print(re.search('dog', 'Dogs are smarter than cats'))

In [None]:
m = re.search('dog', 'Cats are smarter than dogs')
m

In [None]:
m.group()

In [None]:
m.start(), m.end()

In [None]:
m.span()

### Metacharacters
In the above example, the regular expression 'dog' (with literal character) will match the string 'dog' exactly. 
There are exceptions to this rule: some characters are special **metacharacters**, and don't match themselves. 

Here is a complete list of the metacharacters: . ^ $ * + ? { } [ ] \ | ( )

- `.` matches any single character except newline
- `^` matches beginning of line
- `$` matches end of line
- `*` matches 0 or more occurrences of preceding expression
- `+` matches 1 or more occurrences of preceding expression
- `?` matches 0 or 1 occurrence of preceding expression
- `{ }` matches specified number of occurrences of preceding expression: `{n}` matches exactly n; `{n,}` matches n or more; `{n,m}` matches at least n and at most m 
- `[...]` matches any single character in brackets; `[^...]` matches any single character not in brackets
- `\` matches special cases as specified in the character that's followed
- `a|b` matches either a or b
- `( )` groups regular expressions and remembers matched text

Among them the backslash `\` is very useful, for example
- `\d`, `\s`, `\w` match any decimal digit, a whitespace, any word character, respectively
- `\D`, `\S`,  `\W` match any character *except* a decimal digit, a whitespace, a word character, respectively
- `\n`, `\t`, etc. matches newlines, tabs, etc.
- `\b...\b` word boundary
- `\1...\9` matches nth grouped subexpression

The above list of special sequences is not complete. 
For a complete list of sequences and expanded class definitions for Unicode string patterns, see [Regular Expression Syntax](https://docs.python.org/3/library/re.html#re-syntax) in the Standard Library reference.

Let's work through some examples: 

In [None]:
p = re.compile('Batman|Joker') # matches 'Batman' or 'Joker'
m = p.search('Batman and Joker'); print(m)
m = p.search('Joker and Batman'); print(m)

In [None]:
p = re.compile('^From') # matches 'From' only at the beginning
print(p.search('From Here to Eternity'))
print(p.search('Reciting From Memory'))

In [None]:
p = re.compile('}$') # matches '}' only at the end
print(p.search('{block}'))
print(p.search('{block} '))

In [None]:
p = re.compile('[a-z]+') # matches one or more lower-case alphabets
print(p.search(''))
print(p.search('::: message'))

In [None]:
p = re.compile('(Ha){2,3}') # matches 2 to 3 consecutive 'Ha's
print(p.search('Ha'))
print(p.search('HaHa'))
print(p.search('HaHaHa'))

In [None]:
p = re.compile('Bat(wo)?man') # matches 'Batman' or 'Batwoman'
print(p.search('The adventures of Batman'))
print(p.search('The adventures of Batwoman'))

In [None]:
p = re.compile('\d\d\d-\d\d\d-\d\d\d\d') # matches phone number in xxx-xxx-xxxx format
print(p.search('My number is 260-481-0146.').group())

In actual programs, the most common style is to store the `Match` object in a variable, and then check it was `None`.

In [None]:
line = 'some string'
p = re.compile('some')
m = p.search(line)
if m:
    print('Match found: ', m.group())
else:
    print('No match')

### Raw String

As stated earlier, regular expressions use the backslash character `\` to indicate special forms or to allow special characters to be used without invoking their special meaning. 
But this will make it difficult if we would like to match a literal backslash.

The solution is to use Python's **raw string** notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with `r`.

For example, `r'\n'` is a two-character string containing `'\'` and `'n'`, while `'\n'` is a one-character string containing a newline. 
Regular expressions will often be written in Python code using this raw string notation.

For example,

In [None]:
s = "Hello\tfrom ECE 303\n Hi" # normal string literal
print(s)
s = r"Hello\tfrom ECE 303\n Hi" # raw string
print(s)

In the following example, the string `'\bclass\b'` contains a special backspace character `'\b'`. To allow `'\b...\b'` to be used as word boundary, raw string is needed here.

In [None]:
p = re.compile(r'\bclass\b') # matches 'class' only when it is a complete word 
print(p.search('no class today'))
print(p.search('the declassified algorithm'))

### Grouping

Frequently you need to obtain more information than just whether there is a match or not. 
Regular expressions are often used to dissect strings by creating groups using the `(` and `)` metacharacters. 

Suppose you want to separate the area code from the rest of the phone number:

In [None]:
p = re.compile('(\d\d\d)-(\d\d\d-\d\d\d\d)') # a regex with two groups 
m = p.search('My number is 260-481-0146.')
print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))

Passing 0 or nothing to the `group()` method will return the entire matched text. 

Passing 1 or 2 to the `group()` method you can grab different parts of the matched text.

Or if you would like to retrieve all the groups at once, use the `groups()` method.

In [None]:
areaCode, mainNumber = m.groups() 
print(areaCode)
print(mainNumber)

Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string.

For example, `\1` will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise.

The following RE detects doubled words in a string:

In [None]:
p = re.compile(r'\b(\w+)\s+\1\b') # matches any two repeated characters (or words) separated by a whitespace
print(p.search('Paris in the the spring').group())

### Greedy versus Non-Greedy

Python's regular expressions are *greedy* by default, which means that in ambiguous situations they will match the longest string possible. 

In the following example, although `'HaHa'`, `'HaHaHa'`, and `'HaHaHaHa'` are all valid matches of `'(Ha){2,4}'`, it returns `HaHaHaHa` instead of shorter possibilities.

In [None]:
p = re.compile('(Ha){2,4}') # matches 2 to 4 consecutive 'Ha's - greedy version
print(p.search('HaHaHaHa'))

To match the shortest string possible, you can run the *non-greedy* version, with a `?` in the end. 

`?` also works with other metacharacters such as `*?`, `+?`, `??`, etc.

In [None]:
p = re.compile('(Ha){2,4}?') # matches 2 to 4 consecutive 'Ha's - non-greedy version
print(p.search('HaHaHaHa'))

### The `findall()` Method

RE objects also have a `findall()` method. 
While `search()` returns a `Match` object of the *first* matched text in the searched string, the `findall()` method returns the strings of *every* match in the searched string:

- When called on a RE with no groups, the `findall()` method returns a list of string matches
- When called on a RE with groups, the `findall()` method returns a list of tuples of strings

In [None]:
p = re.compile('\d\d\d-\d\d\d-\d\d\d\d') # matches phone number in xxx-xxx-xxxx format
print(p.search('Cell: 260-123-456 Work: 260-481-0146').group())
print(p.findall('Cell: 260-123-4567 Work: 260-481-0146'))
p = re.compile('(\d\d\d)-(\d\d\d)-(\d\d\d\d)') # a regex with three groups
print(p.findall('Cell: 260-123-4567 Work: 260-481-0146'))

### Wildcard

The `.` character in a regular expression is called a **wildcard** and will match any character except for a newline.

For example, the following returns any three-character ends with `'at'`:

In [None]:
p = re.compile('.at')
p.findall('The cat in the hat sat on the flat mat.')

To match an actual dot, escapt the dot with a backslash `\.`

In [None]:
p = re.compile('\.com')
print(p.search('google.com').group())

The `.*` (dot-star) can be used in a regular expression to match *anything*.

For example, if you want to match the string `'First Name:'` followed by any and all text, followed by `'Last Name:'`, and then followed by anything again,

In [None]:
p = re.compile('First Name: (.*) Last Name: (.*)')
print(p.search('First Name: Tom Last Name: Grady').groups())

The dot-star uses *greedy* mode: It tries to match as much text as possible. 

To match in a *non-greedy* mode, use the `.*?` together.

In [None]:
p = re.compile('<.*>')
print(p.search('<html><head><title>Title</title>').group())

p = re.compile('<.*?>')
print(p.search('<html><head><title>Title</title>').group())

<a id="modifying-strings"></a>
## Modifying Strings

Regular expressions are also commonly used to modify strings in various ways:

- `split()` splits the string into a list, splitting it wherever the Regex matches
- `sub()` finds all substrings where the RE matches, and replaces them with a different string
- `subn()` does the same thing as `sub()`, but returns the new string and the number of replacements

### Splitting Strings

The `split()` method of a `Regex` object splits a string apart, returning a list of the pieces. 
It is similar to the `split()` method of strings but provides much more generality in the delimiters that you can split by. 

The following example, the delimiter is any sequence of non-alphanumeric characters:

In [None]:
p = re.compile('\W+')
p.split('This is a test, short and sweet, of split().')

You can also limit the number of splits made, by passing a value of `maxsplit` (or a third argument) to the `split()` method.

In [None]:
p = re.compile('\W+')
p.split('This is a test, short and sweet, of split().', 3)

### Search and Replace

Another common task is to find all the matches for a pattern, and replace them with a different string.

- The `sub()` method takes a replacement value, which can either be a string or a function, and the string to be processed
- The `subn()` method not only does the replacement, but also returns the number of replacements that were performed

In [None]:
p = re.compile('Agent \w+')
print(p.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.'))
print(p.subn('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.'))

You may also use the matched text itself as part of the substitution.

In [None]:
p = re.compile('Agent \w+')
print(p.sub('****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.'))

Replacement can also be a function, which gives you even more control. 
If *replacement* is a function, the function is called for every non-overlapping occurrence of *pattern*. 

In the following example, the replacement function translates decimals into hexadecimal,

In [None]:
def hexrepl(match):
    '''Return the hex string for a decimal number'''
    value = int(match.group())
    return hex(value)

p = re.compile('\d+')
p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')