# Python Regular Expression Tutorial
> Chung-Yu Shao, cshao@andrew.cmu.edu

## Introduction

This tutorial was the summary for my experience to understand the famous **regular expression.** Although we've already had homework 1 which includes tasks related to regular expression, it's a very useful but forgettable tools that we, as a software engineer or researcher in the computer related fileds, should sharpen up on. It often looks scary as the first glance, hope we can learn regular expression better together through this tutorial. In this tutorial, I will mainly cover:

* (1) What is regular expression, what functionality do python `re` suport.
* (2) The `Match` object: what we get from calling the `re` find-like operations.
* (3) The `Pattern` object: mainly, how do we write the pattern
* (4) Advanced discussion
    * 4.1 `\b` and `\B`
    * 4.2 Raw string notation
    * 4.3 The "greedy" match concept
    * 4.4 The extension notation `(?...)`

## [What is regular expression?](https://en.wikipedia.org/wiki/Regular_expression)
> a regular expression (sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. **"find and replace"-like operations**.

So, what kind of find and replace"-like operations python `re` module support?
* `re.search(pattern, string, flags=0)`
    * Checks for a match **anywhere in the string**
* `re.match(pattern, string, flags=0)`
    * Checks for a match only **at the beginning of the string**
* `re.split(pattern, string, maxsplit=0, flags=0)`
    * Split string by the occurrences of pattern.   
* `re.sub(pattern, repl, string, count=0, flags=0)`
    * Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.
* `re.findall(pattern, string, flags=0)`, `re.finditer(pattern, string, flags=0)`    
    
Let's see a quick example!

In [2]:
import re
string = "abcdba"
abMatch = re.match(r"ab", string)
noMatch = re.match(r"cd", string)
cdMatch = re.search(r"cd", string)
print "ab match:{}\nno match: {}\ncd match:{}".format(abMatch, noMatch, cdMatch)
print "----"
pattern = re.compile("ab")
abCompileMatch = pattern.match(string)
print "pattern:{}\nab compile match:{}".format(pattern, abCompileMatch)
print "----"
print "type of r\'ab\':{}\ntype of `pattern`: {}".format(type(r"ab"), type(pattern))

ab match:<_sre.SRE_Match object at 0x1101b2718>
no match: None
cd match:<_sre.SRE_Match object at 0x1101b2578>
----
pattern:<_sre.SRE_Pattern object at 0x1101514f0>
ab compile match:<_sre.SRE_Match object at 0x1101b25e0>
----
type of r'ab':<type 'str'>
type of `pattern`: <type '_sre.SRE_Pattern'>


So, what we can observe from the above snippet? (1) The difference between `re.match()` and `re.search()` is that `re.match()` only search from the beginning, whereas `re.search()` find the pattern in the target string!. (2) There are Mainly two types of objects we will use: `_sre.SRE_Match` and `_sre.SRE_Pattern`, which we will discuss in the later sections. (3) Mainly two styles to use `pattern` as below, which we will discuss later in the advanced discussion: Raw string notation.
```python
abMatch = re.match(r"ab", string)
# or
pattern = re.compile("ab")
abCompileMatch = pattern.match(string)    
``` 
    
---

There are a lot of to discuss for the `pattern` object, especially the rules to build a pattern. So let's start from the `_sre.SRE_Match` object

## `Match` object
Match object is **what we got from the provided find-like operations**. So, what would the user like to have as a result of a "find and replace"-like operations? It might be something like **The string(s) that matche the pattern(s)** and **The index(es) of each matched substring**! That's what `match` provide! That's take a deeper look of these two parts. Noted that there is an imporant concept call **group(s)** in the regular expression. Group means that by using `()` to "group" the pattern you want, you can get the group(s) of string you want to match, as well as the corresponding indexes! Formally speaking, we can concate different subpattern using parenthesis to a large pattern to match a string. Each sub-pattern matching is return in different *group*, with the access function call as below snippet.

* `group([group1, ...])` 
    * Returns one or more subgroups of the match. 
* `groups([default])`
    * Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. 
    

In [3]:
matchObj = re.match(r"(ab).*(de)", "abcde") 
#`.*` to represet any amount of arbitrary characters
#We use 
print matchObj.group()
print "---"
print matchObj.group(0) # The entire match
print matchObj.group(1) # The first parenthesized subgroup.
print matchObj.group(2) # The second parenthesized subgroup.
print matchObj.group(0, 1, 2) # Multiple arguments give us a tuple.
print "---"
print matchObj.groups()

abcde
---
abcde
ab
de
('abcde', 'ab', 'de')
---
('ab', 'de')


Since we already have group(s) of matched object, the indexes we want from `Match` object is therefore retrieved from the `group` object.

* `start([group])` and `end([group])`
    * Return the indices of the start and end of the substring matched by group; 
* `span([group])`
    * For MatchObject m, return the 2-tuple (m.start(group), m.end(group)). 
    * If group did not contribute to the match, this is (-1, -1).

In [4]:
matchObj = re.match(r"(ab).*(de)", "abcde")
print matchObj.group(), matchObj.groups()
print "---"
print matchObj.group(0), matchObj.span(0) # The entire match
print matchObj.group(1), matchObj.span(1) # The first parenthesized subgroup.
print matchObj.group(2), matchObj.span(2) # The second parenthesized subgroup.
print "---"

abcde ('ab', 'de')
---
abcde (0, 5)
ab (0, 2)
de (3, 5)
---


## How to write pattern?
> regular expression (sometimes called a rational expression) is **a sequence of characters** that define a search pattern

Remembered what we lernt from the wikipedia page? Regular expression is composed by a sequence of characters. To understand how to write the pattern, we need to know the **characters** first. Generally speaking, regular expressions, regardless of the language, all separates characters to ordinary characters and special characters.

Most of characters are **"ordinary characters"**, which **simply match themselves**. However, there are **special characters** which either 
* (1) stand for classes of ordinary characters, or 
* (2) affect how the regular expressions around them are interpreted.

Among them, I mainly catagorized the functionalities to be 
* (1) setting anchor
* (2) repetitions
* (3) building logic

#### stand for classes of ordinary characters:
* `'.'`: In the default mode, dot `.` matches any character except a newline.
* `'[]'`: Used to indicate a set of characters
    * Special characters lose their special meaning inside sets. For example, `'[(+*)]'` will match any of the literal characters `'(', '+', '*'`, or `')'`.
* `'[^]'`: Complementing the set. If the first character of the set `'[]'` is `'^'`, all the characters that are not in the set will be matched.
* Character sets

|       | represents      | 
|    ---|              ---|
| `\w`  | `[a-zA-Z0-9_]`  |
| `\W`  | `[^a-zA-Z0-9_]` |
| `\s`  | `[ \n\r\t\f\v]` |
| `\S`  | `[^ \n\r\t\f\v]`|
| `\d`  | `[0-9] `        |
| `\D`  | `[^0-9]`        | 


#### Setting anchor
* `'^'`: Matches the start of the string.
* `'$'`: Matches the end of the string or just before the newline at the end of the string.
* `'(...)'`: Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group
* `'\A'`: Matches only at the start of the string.
* `'\Z'`: Matches only at the end of the string.
* `'\b'`: Matches the empty string, but only at the beginning or end of a word. => Read [Advanced discussion: \b and \B]
* `'\B'`: Matches the empty string, but only when it is not at the beginning or end of a word. => Read [Advanced discussion: \b and \B]()

#### Repetitions
* `'*'`: Match 0 or more repetitions of the preceding RE, as many repetitions as are possible.
* `'+'`: Match 1 or more repetitions of the preceding RE
* `'?'`: Match 0 or 1 repetitions of the preceding RE
* `'{m}'`: Specifies that exactly m copies of the previous RE should be matched
* `'{m,n}'`: Causes the resulting RE to match from m to n (inclusive) repetitions of the preceding RE

#### Building Logic
* `'|'`: `A|B`, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.

---
With the above definition in mind, let's practice an easy but well known example.

### U.S. Phone number matching
#### (1) Supposed the phone number will only have numbers, ex: "6503352800"

In [5]:
print re.match(r"\d{9}", "6503352800").group() # use `\d` and {m} to specify m copies of the number

650335280


#### (2) How about there might be `'-'` between the third, fourth number or sixth, seventh number? Ex: "650-335-2800", "650-3352800", "6503352800", "650335-2800"

In [6]:
print re.match(r"\d{3}-?\d{3}-?\d{4}", "6503352800").group()
print re.match(r"\d{3}-?\d{3}-?\d{4}", "650-335-2800").group()
print re.match(r"\d{3}-?\d{3}-?\d{4}", "650-3352800").group()
print re.match(r"\d{3}-?\d{3}-?\d{4}", "650335-2800").group()

6503352800
650-335-2800
650-3352800
650335-2800


#### (3) Wait, we do have "650-335-2800", "650-3352800", "6503352800", but I have never seen "650335-2800"?

In [7]:
print re.match(r"\d{3}-\d{3}-?\d{4}|\d{9}", "650335-2800")
print re.match(r"\d{3}-\d{3}-?\d{4}|\d{9}", "6503352800").group()
print re.match(r"\d{3}-\d{3}-?\d{4}|\d{9}", "650-3352800").group()
print re.match(r"\d{3}-\d{3}-?\d{4}|\d{9}", "650-335-2800").group()

None
650335280
650-3352800
650-335-2800


#### (4) How about the first hyphen might be parenthesis around the first to three character instead? Ex: "(650)3352800", "(650)335-2800", "650-335-2800", "650-3352800", "6503352800" will all work, but not the other case!

In [8]:
print re.match(r"\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-?\d{4}|\d{9}", "650335-2800")
print re.match(r"\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-?\d{4}|\d{9}", "(650)3352800").group()
print re.match(r"\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-?\d{4}|\d{9}", "(650)335-2800").group()
print re.match(r"\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-?\d{4}|\d{9}", "650-335-2800").group()
print re.match(r"\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-?\d{4}|\d{9}", "650-3352800").group()
print re.match(r"\(\d{3}\)\d{3}-?\d{4}|\d{3}-\d{3}-?\d{4}|\d{9}", "6503352800").group()

None
(650)3352800
(650)335-2800
650-335-2800
650-3352800
650335280



Cool! We've use the simple concepts like `\d`, the character set that represents digits `[0-9]`, the repetition expression `{m}`, `?`, the escape backslash to make special characters normal `\(\)`, and the OR to build the matching logic `|`. However, there are still lots of topics to explored. Followings are the parts that I find useful but confused at the first glance.

---

### Advanced discussion 1: `\b` and `\B`
#### `\b`

The usage of `\b` and `\B` is important! But it's not that intuitive in the first glance of reading the documentation. 
You need to understand that these two match will **match with zero-length.**
> `\b` matches the empty string, but only at the beginning or end of a word, formally, `\b` is defined as the boundary between a `\w` (`[a-zA-Z0-9_]`) and a `\W` character (or vice versa), or between \w and the beginning/end of the string. 

---

> `\B`
*Matches the empty string, but only when it is not at the beginning or end of a word, which means `\B` matches at every position where `\b` does not. 

Effectively, `\B` matches at **any position between two word characters** as well as at any position between two non-word characters. Reading the example and comments below will help you out!

In [9]:
# \b example
print re.search(r"\bfoo\b", "foo").group() #\w and the beginning/end of the string
print re.search(r"\bfoo\b", "foo.").group() #`.` is in the \W set!
print re.search(r"\bfoo\b", "(foo)").group() # `(` and `)` is in the \W set!
print re.search(r"\bfoo\b", "bar foo baz").group() # ` ` is in the \W set!
print re.search(r"\bfoo\b", "foobar") # the right \b can't be matched, since b is in \W set
print re.search(r"\bfoo\b", "foo3") #  3 can't be matched

print "---"

# \B example

print re.search(r"py\B", "python").group() #the empty string was mathced between p"yt"hon
print re.search(r"py\B", "py3").group() #the empty string was mathced between p"y3"
print re.search(r"py\B", "py") # y's next is the end, so we can't match
print re.search(r"py\B", "py.") # y's next is in \W, so we can't match
print re.search(r"py\B", "py!") # y's next is in \W, so we can't match

foo
foo
foo
foo
None
None
---
py
py
None
None
None


### Advanced discussion 2: Raw string notation (r"text") 
Remembered we mentioned there are two styles of function calls:

```python
abMatch = re.match(r"ab", string)
# or
pattern = re.compile("ab")
abCompileMatch = pattern.match(string)    
``` 

So, what's the difference and why do we need the raw string notation (r"text")? It's actually caused by the **Backslash character (`'\'`)**. `'\'` indicate special forms or to allow special characters to be normal in regular expression. However, this conflicts with the string literal, which makes it simply to be a escape sign. In short, matching **a** literal backslash `'\'` in a string, one has to write **"\\\\"** as the RE string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a Python string literal. The `r` at the start of the pattern string designates a python "raw" string which **passes through backslashes without change.**

Take a look for the following examples, noted that in the first parameters in the `re.match`, it should be a pattern string, whereas in the second parameters, it's a string. `"\\"` in a string means the single `"\"`

In [10]:
print re.match("\\\\", "\\").group()
print re.match(r"\\", "\\").group()
print re.match(r"\\", r"\\").group()
print re.match("\\\\", r"\\").group()
print re.match("\\\\w", "\\w").group()
print re.match(r"\\w", "\\w").group()
print re.match("\\\\section", "\\section").group()
print re.match(r"\\section", "\\section").group()

\
\
\
\
\w
\w
\section
\section


### Advanced discussion 3: The "greedy" match concept

What is the "greedy" means in regular expression? 

In [11]:
print re.match("ab*", "a").group()
print re.match("ab*", "ab").group()
print re.match("ab*", "abb").group()
print re.match("ab*", "abbbbbbbbbbb").group()

a
ab
abb
abbbbbbbbbbb


From the example, we can see that `'*'` causes the resulting RE to match 0 or more repetitions of the preceding RE, **as many repetitions as are possible**. This is the so-called greedy in regular expression. However, sometimes greedy is not what we want.

In [12]:
print "greedy: " + re.match(r"p.*q", "pbq c pdq").group() # what if we only want the <a> to be matched?

greedy: pbq c pdq


In the above example, we just want the "pbq", however, the greedy `.*` will give us the whole "pbq c pdq". Followings list the methods/usages to escape from the greedy operations.

#### (1) `*?`, `+?`, `??`

In [13]:
print "non greedy: " + re.match(r"p.*?q", "pbq c pdq").group()
print "---"
print "greedy: " + re.match(r"p.+q", "pbq c pdq").group()
print "non greedy: " + re.match(r"p.+?q", "pbq c pdq").group()
print "---"
print "greedy: " + re.match(r"p?q", "pq").group()
print "non greedy: " + re.match(r"p??q", "pq").group()
print "greedy: " + re.match(r"p?q", "q").group()
print "non greedy: " + re.match(r"p??q", "q").group()

non greedy: pbq
---
greedy: pbq c pdq
non greedy: pbq
---
greedy: pq
non greedy: pq
greedy: q
non greedy: q


The last case `??` and `?`, though has same result, it is actually different in how the regular engine runs.
For the greedy match of `re.match("p?q", "pq")`, it will actually check "pq" first, then check "q". However, if you use `re.match("p??q", "pq")`, it will check "q" first, then check "pq". This concepts also apply on the `'|'`, OR operator.

As the target string is scanned, REs separated by `'|'` are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. Therefore, the document said the **`'|'` operator is never greedy**. 

#### (2) `{m,n}?`

In [14]:
print "greedy: " + re.match(r"a{3,5}", "aaa").group()
print "non greedy: " + re.match(r"a{3,5}?", "aaa").group()
print "greedy: " + re.match(r"a{3,5}", "aaaaa").group()
print "non greedy: " + re.match(r"a{3,5}?", "aaaaa").group()

greedy: aaa
non greedy: aaa
greedy: aaaaa
non greedy: aaa


### Advanced discussion 4:  `(?...)`, the extension notation
`(?...)` is an extension notation. The **first character after the `'?'` determines** what the meaning and further syntax of the construct is.

#### (1) `(?:...)`
* A non-capturing version of regular parentheses.

In [15]:
print re.match(r"(ab)(cd)(e)", "abcde").group()
print re.match(r"(ab)(cd)(e)", "abcde").groups()
print re.match(r"(ab)(cd)(e)", "abcde").group()
print re.match(r"(ab)(?:cd)(e)", "abcde").groups()

abcde
('ab', 'cd', 'e')
abcde
('ab', 'e')


#### (2) `(?P<name>...)` and `(?P=name)`
 `(?P<name>...)`
* the substring matched by the group is accessible via the symbolic group name `name`.
`(?P=name)`
* A backreference to a named group; it matches whatever text was matched by the earlier group named name.
* There is another way to do the backreference: `\number`, which will be shown below

In [16]:
# example 1: set and access the symbolic group name `foo` and `bar`
match = re.match(r"(?P<foo>ab)(c)(?P<bar>de)", "abcde")
print match.group()
print match.groupdict()
print match.groupdict()['foo']
print match.groupdict()['bar']
print match.group(1)
print match.group(2)
print match.group(3)

abcde
{'foo': 'ab', 'bar': 'de'}
ab
de
ab
c
de


In [17]:
# example 2: use the back reference by symbolic group name
print re.match(r"(?P<foo>ab)(c)(?P=foo)", "abcde")
print "---"
match = re.match(r"(?P<foo>ab)(c)(?P=foo)", "abcab")
print match.groupdict() 
print match.groups() # the later group that been referenced by the symbolic name "foo" will not be shown
print match.span(1) 

None
---
{'foo': 'ab'}
('ab', 'c')
(0, 2)


In [18]:
# example 3: use the back reference by \number
match = re.match(r"(ab)(c)(\1)", "abcab")
print match.groups()
match = re.match(r"(ab)(c)(\1)(\2)", "abcabc")
print match.groups()

('ab', 'c', 'ab')
('ab', 'c', 'ab', 'c')



## Summary and references

I've struggled a lot for the regular expression before, through rearranging, covering and testing the Python `re` documentation with this tutorial, I do gain a lot of confidence on using it. Hope the reader will also find it helpful!

* [Python Document: 7.2. re — Regular expression operations](https://docs.python.org/2/library/re.html)
* [Regular Expression HOWTO](https://docs.python.org/2/howto/regex.html#regex-howto)
* [Regular Expression Info](http://www.regular-expressions.info/)