# Simple Patterns
## Matching charactesr
* Most letters and characters will simply match themselves.
* There are exceptions to this rule; some characters are special metacharacters, and don’t match themselves. 
Here’s a complete list of the metacharacters; 
```python
. ^ $ * + ? { } [ ] \ | ( )
```

```
[] They’re used for specifying a character class, which is a set of characters that you wish to match.
1. [abc] is same as [a-c]
2. [akm$] will match a, k, m, $
3. [^5] will match char that's not 5
4. [5^] will match 5 or ^
```

```
"\" As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns;
```

```
\d = [0-9]
\D = [^0-9]
\s = [ \t\n\r\f\v]
\S = [^ \t\n\r\f\v]
\w = [a-zA-Z0-9_]
\W = [^a-zA-Z0-9_]
. = any character except newline 
there’s an alternate mode (re.DOTALL) where it will match even a newline
these sequences can be included inside a character class.
```

## repeating things

```
* specifies the previous char can be matched zero and more times.
ca*t = cat or ct or caaaaaat

+ specifies that previous char can be matched at least once
? specifies that previous char can be matched either 0 or once
{m, n}
a/{1,3}b = a/b or a//b or a///b
```

# Using Regular Expressions
## Compile RE

In [7]:
import re
p = re.compile('ab*')
p
# re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations.
p = re.compile('ab*', re.IGNORECASE)
p

re.compile(r'ab*', re.IGNORECASE|re.UNICODE)

The RE is passed to re.compile() as a string. REs are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them. 

## The Backslash Plague

As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python’s usage of the same character for the same purpose in string literals.

|Characters |  Stage |
| --------   | -----   |
|\section | Escaped backslash for re.compile() |
| "\\\\section" | Escaped backslashes for a string literal |

In short, to match a literal backslash, one has to write '\\\\' as the RE string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. 
The solution is to use Python’s raw string notation. r"\n"

## Performing Matches
match() and search() return None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [20]:
import re
p = re.compile(r'[a-z]+')
print(p.match(""))
m = p.match('tempo')
print(m)
print(m.group(), m.start(), m.end()) # group return the substring that was matched by the RE
m = p.search('::: message'); # the search() method of patterns scans through the string, so the match may not start at zero in that case.
print(m) 

None
<re.Match object; span=(0, 5), match='tempo'>
tempo 0 5
<re.Match object; span=(4, 11), match='message'>


In [22]:
p = re.compile(r'\d+')
matches = p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
for match in matches:
    print(match)

12
11
10


## Module-level Functions
You don’t have to create a pattern object and call its methods; the re module also provides top-level functions called match(), search(), findall(), sub(), and so forth. 

In [5]:
import re

print(re.match(r'From\s+', 'Fromage amk'))
re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')  

None


<re.Match object; span=(0, 5), match='From '>

## Compilation Flags 
Flag | Meaning
:-: | :-:
ASCII, A | Makes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
DOTALL, D | Make . match any character, including newlines.
IGNORECASE, I | do case-insensitive matches
LOCALE, L | do a locale-aware match
MULTILINE, M | Multi-line matching, affecting ^ and $.
VERBOSE, X(for extended) | Enable verbose REs, which can be organized more cleanly and understandably.

In [8]:
charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

# More Pattern Power

# more metacharacters

```|```  A|B will match any string that matches either A or B. To match a literal '|', use \|, or enclose it inside a character class, as in ```[|]```. 

```^``` Matches at the beginning of lines. Unless the MULTILINE flag has been set, this will only match at the beginning of the string. In MULTILINE mode, this also matches immediately after each newline within the string.

In [9]:
print(re.search('^From', 'From Here to Eternity'))  

print(re.search('^From', 'Reciting From Memory'))

<re.Match object; span=(0, 4), match='From'>
None


```$``` match end of the file

```\A``` match at the start of the line even though at multiline mode

```\Z``` match end of the line

```\b``` word boundary

```\B``` only matches when is not at a word boundary

In [11]:
p = re.compile(r'\bclass\b')
print(p.search('no class at all'))

print(p.search('the declassified algorithm'))

print(p.search('one subclass is'))

<re.Match object; span=(3, 8), match='class'>
None
None


## Grouping
Frequently you need to obtain more information than just whether the RE matched or not.

```'('``` and ```')'``` they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier

In [13]:
p = re.compile('(ab)*')
print(p.match('ababababab').span())

(0, 10)


Groups are numbered starting with 0. Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument. 

In [17]:
p = re.compile('(a)b')
m = p.match('ab')
print(m.group())
print(m.group(0))

ab
ab


Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.

In [22]:
p = re.compile('(a(b)c)d')
m = p.match('abcd')
print(m.group(2))
print(m.group(1,2)) # groups can be passed multiple numbers. Return a tuple
print(m.groups())

b
('abc', 'b')
('abc', 'b')
