# Regular Expressions



## References

* Introducing Regular Expressions, by Michael Fitzgerald, 2012 
* https://www.tutorialspoint.com/python/python_reg_expressions.htm
* https://www.machinelearningplus.com/python/python-regex-tutorial-examples/
* https://regexone.com
* https://www.rexegg.com

## What is a Regular Expression?

**Regular expressions** are specially encoded strings that can be used as patterns or match sets of strings.  For regular expressions, everything is essentially a character.  We will write patterns to match specific sequences of characters (strings).  Mostly, we will focus on normal ASCII characters - letters, digits, punctuations, other symbols, but we can also use unicode characters. 

Regular expressions are important in base Unix operating systems commands, `ed`, `sed`, `awk`, `grep`, etc.  

Regular expressions are implemented in almost every modern computer language: JavaScript, Java, C/C++, C#, Perl, Ruby, R, etc.  In Python, we will use the module `re`. 

You can test many of the examples presented directly in the Jupyter notebook or can use other resources, e.g., the Regexpal website https://www.regexpal.com. 

I am going to skip over the formal theory about regular expressions (comes from formal language theory), and instead walk you through a set of examples to give you practice learning and using regular expressions.  

## Regular Expression Syntax


| Regex syntex  |  Meaning   |
|:--------------|:-----------|
| [abc]         | a single character of: a, b, or c  | 
| [^abc]        | none of these characters | 
| [a-z]         | any character in this range |
| [^a-z]        | none of the characters in this range | 
| .             | any single character | 
| \\.           | a period  | 
| [123]         | matches only 1 or 2 or 3 | 
| [0-9]         | matches numbers 0 to 9 | 
| \d            | any digit | 
| \D            | any non-digit character | 
| \w            | any alphanumeric character | 
| \W            | any non-alphanumeric character | 
| \s            | white space | 
| \S            | non-white space | 
| *             | zero or more repetitions | 
| +             | one or more repetitions |
| {m}           | m repetitions | 
| {m,n}         | m - n repetitions | 

# Examples 

First, import the `re` module.

In [None]:
import re

## Example 1 - Match North American Phone Number

Let's work to match a North American phone number with a regular expression, e.g., 555-123-4567

In [None]:
pnum = '555-123-4567'

We will start with using the function `re.match()`

The syntax of the function is `re.match(pattern, string, flags=0)`

In [None]:
match_obj = re.match('555-123-4567', pnum)

if match_obj: 
  print("The pattern matched")
else: 
  print("The pattern did not match")

In [None]:
match_obj.group()

This match is a string literal, we match the string directly in the target text. 

Usually, we want to match many possible inputs, what if the phone number was 906-555-1234 or 906-555-9876 or ...

### Match Digits with a Character Class

Let's change the regular expression to match all numbers in the phone number at once. 



In [None]:
m = re.match('[0123456789]', pnum)
m.group()

In [None]:
m = re.match('[13579]', pnum)
m.group()

In [None]:
m = re.match('[13579][13579]', pnum)
m.group()

To match any 10-digit, North American phone number with hyphens, you could do the following: 

In [None]:
m = re.match('[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]', pnum)
m.group()

### Using a Character Shorthand

We can use the shorthand `\d` to match digits just like [0-9]. 

In [None]:
m = re.match(r'\d\d\d-\d\d\d-\d\d\d\d', pnum)
m.group()

Note, we introduced raw string notation (`r"text"`) above.  This notation helps ease the use of regular expressions.  Otherwise, the backslash, `\` would need to be repeated be recognized. 

We could also change the expression to match not just a hyphen but any character that is not a digit using the `\D` syntax.

In [None]:
pnum2 = '906 555 1234'

In [None]:
m = re.match(r'\d\d\d\D\d\d\d\D\d\d\d\d', pnum)
m.group()

In [None]:
m2 = re.match(r'\d\d\d\D\d\d\d\D\d\d\d\d', pnum2)
m2.group()

### Capturing Groups and Back References

Let's now match just a portion of the phone number using a *capturing group*.  To create a capturing group enclose a `\d` in parentheses to place it in a group, then follow it with a `\1` to backreference what was captured: 

In [None]:
m = re.match(r'(\d)\d\1', '707')
m.group()

In the above expression, the `\1` refers back to what was captured in the group enclosed by parentheses.  As a result, this regular expression matches the prefix `707`.  Here is how: 

* `(\d)` matches the first digit and captures it (the number 7)
* `\d` matches the next digit (the number 0) 
* `\1` references the captured digit (the number 7)  

In [None]:
pnum2

In [None]:
m2 = re.match(r'(\d)\d\1', pnum2)
m2.group()

Here we get an error because the pattern does not match! 

In [None]:
m2 = re.match(r'(\d)\d\1', '909-555-1234')
m2.group()

### Using Quantifiers

Here is another way to match a phone number with a different syntax.

In [None]:
m = re.match(r'\d{3}-?\d{3}-?\d{4}', pnum)
m.group()

The numbers in the curly braces `{` `}` tell the regex processor *exactly* how many occurances of those digits to look for.  The braces with numbers are a kind of *quantifier*.  

The question mark (`?`) is another kind of quantifier.  It follows the hyphen in the regular expression and means the hyphen is optional.  That is, there can be zero or one occurrence of the hyphen.  Other quantifiers exists: the `+` plus sign means "one or more" or the `*` asterisk means "zero or more". 

Test the above regex pattern on other phone number styles.

In [None]:
m3 = re.match(r'\d{3}-?\d{3}-?\d{4}', '9065551234')
m3.group()

Let's look at an even more concise (but more complicated regex pattern).

In [None]:
m = re.match(r'(\d{3,4}[.-]?)+', pnum)
m.group()

We can break this down: 

* `(` opens a capturing group 
* `\` start character shorthand 
* `d` end character shorthand (match any digit) 
* `{` open quantifier 
* `3` minimum quantity to match 
* `,` separate quantities 
* `4` maximum quantity to match 
* `}` close quantifier 
* `[` open character class 
* `.` dot or period 
* `-` literal character to match - hyphen 
* `]` close character class 
* `?` zero or one quantifier 
* `)` close capturing group 
* `+` one or more quantifier 

This works, but would also match other groups of 3 or 4 digits.

In [None]:
m = re.match(r'(\d{3,4}[.-]?)+', pnum)
m.group()

In [None]:
pnum3 = "1234-1234-1234"

In [None]:
pnum3 = "1234-1234-1234"
m = re.match(r'(\d{3,4}[.-]?)+', pnum3)
m.group()

### Matches and Non-match cases

For some problems, we want to match certain strings but also not match others. 

Below are a couple lines, where we only want to match the first three strings, but not the last three strings.

* fan
* man 
* can 
* dan 
* ran
* pan 

In [None]:
ex_strs = ("fan", "man", "can", "dan", "ran", "pan")
for elem in ex_strs:
    m = re.match("[fmc]an", elem)
    if m:
        print(m.group())

Or, alternatively ...

In [None]:
for elem in ex_strs:
    m = re.match("[^drp]an", elem)
    if m: 
        print (m.group())

Let's now us this in the phone number example. 

In [None]:
m = re.match(r'(\d{3}[.-]?){2}\d{4}', pnum)
if m:
    print(m.group())
else:
    print("No match")

In [None]:
m = re.match(r'(\d{3}[.-]?){2}\d{4}', pnum3)
if m:
    print(m.group())
else:
    print("No match")

This will match two nonparenthesized sequences of three digits each, followed by an optional hyphen, and then followed by exactly four digits.

## Example 2 - Pattern Matching 

Let us load in a text to work with.  We will use "The Rime of the Ancient Mariner" by Samuel Taylor Coleridge, first published in Lyrical Ballads (London, J. & A. Arch, 1978). 

The plain text version is available in `rime-intro.txt`.

In [None]:
# Read in the text
with open('data/rime-intro.txt', 'r') as f:
    rime = f.readlines()

f.close()

In [None]:
rime

### Matching String Literals

First, let's match string literals.  For instance, to find the match to Ship, search for "Ship"

In [None]:
for line in rime:
    #print (line)
    m = re.search(r"Ship", line)
    if m:
        print(m.group())
    else:
        print("No match")

### Matching Digits 

The character shorthand `\d` can be used to match digits, or a range of digits `[0-9]`.

In [None]:
for line in rime:
    #print (line)
    m = re.search(r"\d", line)
    if m:
        print(m.group())
    else:
        print("No match")

### <a name="ques1"></a> Matching Word and Non-Word Characters

Now consider the shorthand `\w`.  Examine the difference between `\D` and `\w`. 

In [None]:
for line in rime:
    #print (line)
    m = re.search(r"\D", line)
    if m:
        print(m.group())
    else:
        print("No match")

In [None]:
for line in rime:
    #print (line)
    m = re.search(r"\w", line)
    if m:
        print(m.group())
    else:
        print("No match")

What is different?  [Answer below](#ans1) 

### Matching Whitespace 

Whitespace is matched using the shorthand character `\s`

In [None]:
for line in rime:
    #print (line)
    m = re.search(r"\s", line)
    if m:
        print(m.group())
    else:
        print("No match")

What does this mean?  Every single line of text has whitespace. 

What if we change the funciton used from `re.search` to `re.match`?

In [None]:
for line in rime:
    #print (line)
    m = re.match(r"\s", line)
    if m:
        print(m.group())
    else:
        print("No match")

This illustrates the difference between 

* [`re.search`](https://docs.python.org/3.7/library/re.html#re.search) which **scans** through the *string* for the first location of the *pattern* to match and returns the corresponding match object, and 
* [`re.match`](https://docs.python.org/3.7/library/re.html#re.match) which returns the match object if zero or more characters at the **beginning** of *string* match the *pattern*. 

### Matching Any Character 

The `.` can be used to match any character. 

To match any 8 characters you could then use `........`. 

A better way would be to specify the number with a quantifier `.{8}`

In [None]:
for line in rime:
    # Fix the pattern below?
    m = re.search(r".{8}", line)
    if m:
        print(m.group())
    else:
        print("No match")

### <a name="ques2"></a>Word Boundaries 

The shorthand `\b` matches a word boundary, without consuming characters.  

What does `\bA.{5}t\b` match?  [Answer below](#ans2) 

In [None]:
for line in rime:
    # Fix the pattern below?
    m = re.search(r"\bA.{5}t\b", line)
    if m:
        print(m.group())
    else:
        print("No match")

### Match Arbitrary Number of Characters 

The Kleene Star `*` and Kleene Plus `+` allows to represent **0 or more** and **1 or more** of the character that it follows respectively. 

In the example, let's try to match the first three strings but skip the fourth. 

In [None]:
ex_strs = ("aaaabcc", "aabbbc", "aaccc", "a")
for elem in ex_strs:
    #print (elem)
    m = re.match("a+b+c+", elem)
    if m: 
        print (elem + "\t" + m.group())
    else:
        print (elem + "\t" + "No match")

### Optional Characters 

Another common quantifier is the `?` character, which denotes optionality. 

This allows you to match either zero or one of the preceding character or group. 

In [None]:
ex_strs = ("abc", "abcd", "ac", "ad", "bc")
for elem in ex_strs:
    #print (elem)
    m = re.match("ab?c", elem)
    if m: 
        print (elem + "\t" + m.group())
    else:
        print (elem + "\t" + "No match")

## Additional Examples

Complete the lessons.  https://regexone.com

<hr>

## Example Question Answers

<a name="ans1"></a>Back to Question [What is different between `\D` and `\w`?](#ques1)  

* `\D` matches whitespace, punctuation, quotation marks, hyphens, forward slashes, brackets, etc. 
* `\w` matches letters and numbers

In other words, `\w` matches `\[a-zA-Z0-9]`

<a name="ans2"></a>Back to Question [What does `\bA.{5}t\b` match?](#ques2)

Ancyent 
* The shorthand `\b` matches a word boundary, without consuming any characters.
* The characters `A` and `t` also bound the sequence of characters.
* .{5} matches any five characters.
* Match another word boundary with `\b`.