In [1]:
import re

'def'

#### String Literal
* Case-sensitive

In [19]:
re.findall('cat', 'catatcatCat')

['cat', 'cat']

#### Metacharacters
* To use metacharacter as a string literal, must use `\` to escape it.
* As a convention, always escape literal braces `\{\}`
* To include `^ [, \, -`, escape with `\` (`\\` is a back slash literal)

| Metacharacters | Meaning                     |
| :------------- | :-------------------------- |
| `[]`           | Character class             |
| `\`            | Escape                      |
| `^`            | Negate; anchor string start |
| `$`            | Anchor string end           |
| `.`            | Match any                   |
| |              | Alternative                 |
| `?`            | Optional                    |
| `*`            | Repeat >=0 time             |
| `+`            | Repeat >=1 time             |
| `{}`           | Special repeatition operator|
| `()`           | Grouping                    |
| `-`            | Range                       |

#### Characte set
* Enclose acceptible characters in square brackets: `[]`

In [22]:
# a single character is matched
re.findall('gr[ae]y', 'grey or gray')

['grey', 'gray']

In [23]:
# match a range of character
re.findall('[0-9]', '5 dollars 20 cents')

['5', '2', '0']

In [31]:
# can combine single match and range match
# match hex digit 0-9, a-f, A-F, x or X
re.findall('[0-9a-fxA-FX]', 'memory 0x01AE3')

['e', '0', 'x', '0', '1', 'A', 'E', '3']

#### Negated Character
* Typing a caret after the opening square bracket will negate the character class. 
* «q[^u]» does not mean: “a q not followed by a u”. It means: “a q followed by a character that is not a u”

In [32]:
re.findall('m[^o]', 'memory ram')

['me']

#### Shorthand
* `\d`: digit.
* `\w`: word character.
* `\s`: whitespace character.
* `\^d`: `\D`
* `\^w`: `\W`
* `\^s`: `\S`

Note: `[\D\W]` matches everything; `[^\d\s]` matches (1) non digit character; or (2) white space.

In [33]:
re.findall('\d', 'memory 1024')

['1', '0', '2', '4']

In [34]:
re.findall('\w', 'memory 1024')

['m', 'e', 'm', 'o', 'r', 'y', '1', '0', '2', '4']

In [35]:
re.findall('\s', 'memory 1024')

[' ']

In [41]:
re.findall('\s', 'memory\t1024\n')

['\t', '\n']

In [42]:
re.findall('\D', 'memory 1024')

['m', 'e', 'm', 'o', 'r', 'y', ' ']

In [43]:
re.findall('\W', 'memory 1024')

[' ']

In [44]:
re.findall('\S', 'memory 1024')

['m', 'e', 'm', 'o', 'r', 'y', '1', '0', '2', '4']

#### Repeating Character
* repeat the entire character class, and not just the character that it matched

In [47]:
re.findall('o', 'Gooooogle')

['o', 'o', 'o', 'o', 'o']

In [48]:
re.findall('o+', 'Gooooogle')

['ooooo']

In [50]:
re.findall('o*', 'Gooooogle')

['', 'ooooo', '', '', '', '']

In [51]:
re.findall('o?', 'Gooooogle')

['', 'o', 'o', 'o', 'o', 'o', '', '', '', '']

In [57]:
re.findall('[\do]+', 'Gooooogle 101')

['ooooo', '101']

#### Dot Matches (Almost) Any Character
* the dot is short for the negated character class `[^\n]`
* dot matches a single character, without caring what that character is
* Use with extreme care:
    - `\d\d.\d\d.\d\d` is a bad expression.
    - `\d\d[- /.]\d\d[- /.]\d\d` limits separator to hyphen, space, slash or dot.

#### String Anchor
* `^` (out side of class) matches the position before the first character in the string
* `$` matches right after the last character in the string. 

In [58]:
re.findall('^H', 'Hello Hell')

['H']

In [59]:
re.findall('^H', 'Hello Hell')

[]

In [63]:
re.findall('ll$', 'Hello Hell')

['ll']

In [66]:
# by defalt, new line is NOT treated as new string
re.findall('^H', 'Hello\nHell')

['H']

#### Word Boundary Anchor
* Before the first character in the string, if the first character is a word character.
* After the last character in the string, if the last character is a word character (e.g. space)
* Between a word character and a non-word character following right after the word character.
* Between a non-word character and a word character following right after the non-word character.
* «\B» is the negated version of «\b». «\B» matches at every position where «\b» does not.

`\bis\b` matches:
```
This island is beautiful
^   ^^     ^^^^
```

In [69]:
re.findall('\bis\b', 'This island is beautiful')

[]

#### Alternation with The Vertical Bar or Pipe Symbol
* Lowest precedence
* Order matters

In [70]:
re.findall('GetValue|Get|SetValue|Set', 'SetValue')

['SetValue']

In [71]:
re.findall('GetValue|Get|Set|SetValue', 'SetValue')

['Set']

In [76]:
re.findall('GetValue|Get|Set(Value)?', 'SetValue')

['Value']

#### Optional Character
* `colou?r` matches color and colour
* Greedy: try to include the bracket first
* Doesn't work as expected in Python

In [77]:
re.findall('colou?r', 'This colour is colorful')

['colour', 'color']

In [78]:
re.findall('Nov(ember)?', 'November is Nov')

['ember', '']

#### Repeatition
* Limit how many times: `{min,max}`, exactly N time: `{N}`
* `+`: once or more, same as `{1, }`
* `*`: zero of more, same as `{0, }`
* Greedy: only moves on when a character match fail.
    - `<.+>` matches `This is a <EM>first</EM> test`

In [81]:
# must start with letter
re.findall('<[A-Za-z][A-Za-z0-9]*>', '<h2></h2>')

['<h2>']

In [82]:
re.findall('<[A-Za-z][A-Za-z0-9]*>', '<span></span>')

['<span>']

![alt-text](assets/backtrack.png)

In [83]:
# first matches <EM>first</EM> test
# plus requires the dot to match only once.
# back track to <EM>first</EM>
re.findall('<.+>', 'This is a <EM>first</EM> test')

['<EM>first</EM>']

##### Lazy `?` instead of greedy
* tells the regex engine to repeat the dot as few times as possible.
* backtracking will force the lazy plus to expand rather than reduce its reac

In [87]:
re.findall('<.+?>', 'This is a <EM>first</EM> test')

['<EM>', '</EM>']

In [88]:
re.findall('<[^>]+>', 'This is a <EM>first</EM> test')

['<EM>', '</EM>']

#### Grouping & Back Reference
* To avoid back reference: `Set(?:Value)?`

In [92]:
re.findall('\d\d(\d\d)-\1-\1', '2008-08-08')

[]