## Regular Expressions (aka: regex)
**Regular Expressions are the most basic tool for basic text processing and provides a formal language for specifying text strings. It can be used with all programming langs and the syntax is very similar between all.**

[Regex How to Reference](https://docs.python.org/3/howto/regex.html#regex-howto)

**In Python there are 4 general steps to using regex:**

1. Import the regex module with import re.

2. Create a Regex object with the re.compile() function

3. Pass the string you want to examine into desired matching method which if succesful will return a Match object.
   
4. Once the Match object is obtained, built-in methods and attr can then be used to return the desired data

In [73]:
# Basic example of all 4 steps (sections discussed in detail below)

# 1. Import re
import re

# 2. Creating a regex object using re.compile, this returns a Pattern object
regx1 = re.compile(r'jackass')
print(type(regx1))

# 3.Pass the string you want to search into the regex objects seach() method
#   which returns a Match object
test_str = "Don't do that You jackass!"
match1 = regx1.search(test_str)
print(type(match1))

# 4. Call the Match objects desired method, here group
print(match1.group())

<class 're.Pattern'>
<class 're.Match'>
jackass


**Regular expressions use the backslash character to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals;**

**The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation. See the image below to see that using raw strings is much simpler than otherwise**

<img src="img/backslash.png">

### Regex Syntax
**Regex uses special characters called *metacharaters* that are used to signal specific match criteria for any given regular expression. These can be confusing and are explained in detail below**

In [None]:
# Regex Metacharacters
.  ^  $  *  +  ?  { }  [ ]  \  |  ( )

### [  ] - Brackets
---
the brackets metacharater is used to specify specific set of characters
you want to match and is called a character class. Chars can be listed individually or in a range separated with a dash.

For example:
* [abc] will match and of the charactes 'a', 'b, or 'c' and the set
* [a-c] will do the same thing

metacharactesr themselves are NOT active inside classes, for example:
[adm?] will match any of the chars inside the class. '?' is usually a 
metacharacter, but it's stripped of this when inside a character class [ ].

In [8]:
# Brackets example
# This example locates all capital letters in a string:
import re
test_str = 'Do you have a ProBlem Jackass?'
re = re.compile(r'[A-Z]')
print(re.findall(test_str))

['D', 'P', 'B', 'J']


### ^ - Carat
---
The carat will match any characters NOT listed in a set
Ex: [^5] will match all characters except 5.

Note that the ^ must precede the regex, if it is located anywhere else
inside the character class, it becomes just a char itself, like the ? ex abv.

Special exception: ^[A-Z] - matches capital letters at the start of a line

In [9]:
# Carat example

# The regex r'[^A-Z]' mathces all chars that are NOT capital letters
# Note, the carat is INSIDE of the [ ], also note whitespace is included
import re
test_str = 'Do you have a ProBlem Jackass?'
regex = re.compile(r'[^A-Z]')
print(regex.findall(test_str))

print()

# To remove whitespace use, this says find all not A-Z or whitespace:
regex = re.compile(r'[^A-Z\s]')
print(regex.findall(test_str))

print()

# To remove ? as well, this says find all not A-Z, whitespace, or ?
regex = re.compile(r'[^A-Z\s\?]')
print(regex.findall(test_str))

['o', ' ', 'y', 'o', 'u', ' ', 'h', 'a', 'v', 'e', ' ', 'a', ' ', 'r', 'o', 'l', 'e', 'm', ' ', 'a', 'c', 'k', 'a', 's', 's', '?']

['o', 'y', 'o', 'u', 'h', 'a', 'v', 'e', 'a', 'r', 'o', 'l', 'e', 'm', 'a', 'c', 'k', 'a', 's', 's', '?']

['o', 'y', 'o', 'u', 'h', 'a', 'v', 'e', 'a', 'r', 'o', 'l', 'e', 'm', 'a', 'c', 'k', 'a', 's', 's']


### \ - Backslash
---
The backslash is one of the more important metachars. As in Python strings, this char can be followed by various chars to signal special sequences. It also can be used to escape all the metachars when needed.

In [10]:
# Ex: if you need to match [ or \ you would use \[ and \\ respectively

# Some of the special sequences beginning with \ represent pre-defined sets of
# chars that are often useful.

# the sequences listed below can be included inside of a char class.
# ex: [\s,.]
# the above would match any whitespace char  or any ,  or any .

**Common backslash sequences**

1. **\d** matches any decimal digit; = to [0-9]
2. **\D** matches any non-digit char = to [^0-9]
3. **\s** matches any whitespace char = to [ \t\n\r\f\v]
4. **\S** matches any non-whitespace char = to [^ \t\n\r\f\v]
5. **\w** matches any alphanumeric char = to [a-zA-Z0-9_]
6. **\W** matches any non-aplhanumeric char = to [^a-zA-Z0-9_]

[Complete sequence reference](https://docs.python.org/3/library/re.html#re-syntax)

### .  - Period
---
This period metacharacter matches anything except a newline char and there's an alternative mode (re.DOTALL) that will match a newline char. This special char is often used when you want to match 'any character.'

In [11]:
# Period example

import re
test_str = 'Have we begun yet? What. When are we going to beg5n?'

# . means anything but whitespace will work
re1 = re.compile(r'beg.n') 
print(re1.findall(test_str))
print()

# It is EXTREMEMLTY important to remember that a . needs to have a \. in front
# or else it will return all possible chars. 
re2 = re.compile(r'\.')
print(re2.findall(test_str))
print()

# Here . used to return all characters (including whitespace) from a string. 
# Note this returns a tokenized list. This is an easy way to
re3 = re.compile(r'.')
print(re3.findall(test_str))

['begun', 'beg5n']

['.']

['H', 'a', 'v', 'e', ' ', 'w', 'e', ' ', 'b', 'e', 'g', 'u', 'n', ' ', 'y', 'e', 't', '?', ' ', 'W', 'h', 'a', 't', '.', ' ', 'W', 'h', 'e', 'n', ' ', 'a', 'r', 'e', ' ', 'w', 'e', ' ', 'g', 'o', 'i', 'n', 'g', ' ', 't', 'o', ' ', 'b', 'e', 'g', '5', 'n', '?']


### * - Star (Repeating Char 1)
---
Any char preceding a * can be matched ZERO OR MORE previous chars. ex: ca*t would match 'caaat' three times as there are 3 a's and ca*t would match 'ct' zero times as there are 0 a's note that even thoush 'ct' has no a's, it still will match the c and t 

In [12]:
# Star example

import re
test_str = 'hooooooooly moly Batman?'
re1 = re.compile(r'ho*ly|moly')
print(re1.findall(test_str))

['hooooooooly', 'moly']


### + - Plus (Repeating Char 2)
---
The difference between + and * is that + REQUIRES AT LEAST ONE OCCURANCE of the previous char

Examples: 
* ca+t would match 'caaat' as a is present 3 times
* ca+t WOULD NOT match 'ct' as there is no 'a' present

In [13]:
# Plus example

import re
test_str = 'hooooooooly moly Batman?'
re1 = re.compile(r'ho+ly')
print(re1.findall(test_str))

['hooooooooly']


### ? - Question mark (Repeating Char 3)
---
This char matches either once or zero times, and can be thought of as marking the previous character as optional.

Example:
* home-?brew would match either 'homebrew' or 'home-brew'

In [14]:
# Question mark example

import re
test_str = 'Nice hair color?'
re1 = re.compile(r'colou?r')
print(re1.findall(test_str))

['color']


### { } - Curly Braces (Repeating Char 4)
---
Curly braces are a little more complicated.

Example: <br>
{m,n} would mean there must be at least m repititions, and at most n

In [15]:
# Curly braces example

import re 

test_str1 = 'a/b'
test_str2 = 'a///b'
test_str3 = 'a///////b'

# Here {1,3} means the / must at least appear once and no more than 3 times.
re1 = re.compile(r'a/{1,3}b')

# Thus the above regex would match a/b, a//b, a///b, but not a////b
print(re1.findall(test_str1))
print(re1.findall(test_str2))
print(re1.findall(test_str3))

['a/b']
['a///b']
[]


### | - Pipe is used for disjunction aka similar to or in a for statement
---
* ex1: groundhog|woodchuck  (means groundhog or woodchuck)
* ex2: a|b|c (a or b or c,  same as [abc])
* ex3: [gG]roundhog|[Ww]oodchuck (takes either cap or lowercase first letter)

In [16]:
# Pipe example

import re
test_str = 'Do you have a ProBlem jackass?'
re1 = re.compile(r'you|jackass')
re2 = re.compile(r'jack|ass|Pro')
re3 = re.compile(r'[Jj]ackass|[Dd]o')

print(re1.findall(test_str))
print(re2.findall(test_str))
print(re3.findall(test_str))

['you', 'jackass']
['Pro', 'jack', 'ass']
['Do', 'jackass']


### \$ - Dollar Sign
---
The \$ symbol matches characters at the end of a line.

In [17]:
# Dollar sign example

import re
test_str = 'Do you have a ProBlem jackasS?'
re1 = re.compile(r'[S$]')

# [A-Z]$ - matches capital letters at the end of a line
re2 = re.compile(r'[A-Z]$')

re3 = re.compile(r'[?$]')

print(re1.findall(test_str))
print(re2.findall(test_str)) #note end of line contains ? not S so no match
print(re3.findall(test_str))

['S']
[]
['?']


### Pattern objects and their matching methods
---
**When compiling a regex there are some flags that can be used to automatically perfom some regex calcs prior to any actual anaylsis. These are listed bellow**
* **ASCII** - matches only on ASCII chars
* **DOTALL, S** - Make . match any char, including newlines
* **INGORECASE, I** - Do case-insensitive matches
* **LOCALE, L** - Do locale-aware match
* **MULTILINE, M** - Multi-line matching, affecting ^
* **VERBOSE** - Enable verbose RE's, which can be more easily understood

**Once a regex is compiled a Pattern object is returned (step 2 of the regex process process). Pattern objects have serveral methods and attributes, below are the most siginifcant ones.

* **match()** - Determines if the RE matches at the beginning of the string
* **search()** - Scans through a string, looking for any location with match
* **findall()** - Finds all substrings where RE matches, returns as a list
* **finditer()** - Finda all substrings with RE match, returns as an iterator
* **sub()** - DIAL THIS

***Note that match() and search() both return None if no match can be found. 

For a complete method/attr reference see the re docs**
[regex docs](https://docs.python.org/3/library/re.html#module-re)

In [70]:
# Basic usage of match()
xmasRegex = re.compile(r'\d+\s\w+')

# Note that xmasRegex is a Pattern object
print(type(xmasRegex))

# The match() method only returns the first match found
test = xmasRegex.match('12 drummers, 11pipers, 10 lords, 9ladies')

# Note that test is a Match object, which means that for further analysis
# a Match object method must be used, (see next section)
print(type(test))
print(test)

<class 're.Pattern'>
<class 're.Match'>
<re.Match object; span=(0, 11), match='12 drummers'>


In [18]:
# Basic usage of findall()

# this regex finds any number of digits ,followed by a single whitespace,
# followed by any number of letter characters
xmasRegex = re.compile(r'\d+\s\w+')

# findall() returns all matches within the string. 
test = xmasRegex.findall('12 drummers, 11pipers, 10 lords, 9ladies')

# note that test here is a list as findall() returns a list of all matches
print(type(test))
print(test)

<class 'list'>
['12 drummers', '10 lords']


### Matching objects and their methods

**A successfully run Pattern method will return a Match object which will contain info about the match, where it starts and ends, the substring it matched, and more. This object can be queried for more info about the matching string using it's built-in methods/attr**

* **group ()** - Returns a string matched by the RE
* **start ()** - Returns the starting position of the match
* **end ()**   - Returns the ending position of the match
* **span ()**  - Returns a tuple containing the (start, end) positions of the match 

### Grouping
**Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. for example a phone number or email.**

**Groups indicated with '(', ')' also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument to group(), start(), end(), and span(). Groups are numbered starting with 0. Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument.**

In [19]:
# () are used to create groups.

import re 

#note (\d\d\d) is a group with three digits
phoneNum2 = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNum2.search('My number is 678-327-8581')

# group()can return subgroups, or multiple groups at a time

print(mo.group(1))
print(mo.group(2))
print(mo.group())

# groups() returns tuple containing all groups.
print(mo.groups())

678
327-8581
678-327-8581
('678', '327-8581')


### More examples of common regex queries


In [20]:
# Finding multiple occurences of the same word with different capitlization
import re
test_str = 'Where in the hell is the other remote? The blithe one is here!'

# This locates all upper & lowercase 'the', however, it also captures 'the'
# in the word other, which is incorrect if you just want the word 'the' 
re1 = re.compile(r'[Tt]he')
print(re1.findall(test_str))

# To solve this issue we want 'the' without any other chars around it
re1 = re.compile(r'[^A-Za-z][Tt]he[^A-Za-z]')
print(re1.findall(test_str))

['the', 'the', 'the', 'The', 'the']
[' the ', ' the ', ' The ']


In [None]:
# Note that in the example above two types of errors are encountered

# Type I - False Positives (matching strings that should not match)
# Type II = False Negatives (not matching things that should be matched) 

In [105]:
# Complex phone number regex

import re

# For a US phone # there can be 4 possible sections
# 1. country code(+1) OPTIONAL
# 2. area code (3 digits) OPTIONAL
# 3. first 3 digits REQUIRED
# 4. Final 4 digits REQUIRED
# 5. all symbols are optionsl (. , - *)

# Here each numerical section is grouped with ()
# I didn't include the country code, but the area code is optional like ()?
# Note that each number group has the same setup ([]{})
# for example ([0-9]{2,3}) means no less than 2 and no more than 3 digits
# note that - and . are both escaped with \ and made optional with ? like \.?
re1 = re.compile(r'([0-9]{2,3})?\-?\.?([0-9]{2,3})\-?\.?([0-9]{4})')

test_num1 = '7705955539'
test_num2 = '770-595-5539'
test_num3 = '770.595.5539'
test_num4 = '770 595 5539'

print(f'Num1: {re1.findall(test_num1)}')
print(f'Num2: {re1.findall(test_num2)}')
print(f'Num3: {re1.findall(test_num3)}')
print(f'Num4: {re1.findall(test_num4)}') # this returns empty []because no \s+

# Below is a modified regex to include empty whitspace to fix num4 issue.
# \s+ was added and note that all special chars are optional with [\-\.\s+]?
# this is a little easier to understand than the first example. And it works
# for all the numbers
re2 = re.compile(r'([0-9]{2,3})?[\-\.\s+]?([0-9]{2,3})[\-\.\s+]?([0-9]{4})')
print(f'Num4 Fixed: {re2.findall(test_num4)}')

# In a production environment some type of check would need to be implemented
# to produce an error message if incorrect format entered. Because findall()
# returns an empty list if there are no matches, this is an easy method
test_num5 = '770,595,5539'
if len(re2.findall(test_num5)) is 0:
    print(f'The number you entered: {test_num5} is not formatted properly')

# Below () surround the area code and this is not in either
# of the above regex's, therefore an empty string is returned for area code
test_num6 = '(770)595-5539'
print(f'Num5: {re2.findall(test_num6)}')

# Here a revised regex to account for the brackets around the area code
# that still works with all other numbers, note both brackets were made
# optional \(? and \)?
re2 = re.compile(r'(\(?[0-9]{2,3}\)?)?[\-\.\s+]?([0-9]{2,3})[\-\.\s+]?([0-9]{4})')
print(f'Num4 Fixed: {re2.findall(test_num6)}')

Num1: [('770', '595', '5539')]
Num2: [('770', '595', '5539')]
Num3: [('770', '595', '5539')]
Num4: []
Num4 Fixed: [('770', '595', '5539')]
The number you entered: 770,595,5539 is not formatted properly
Num5: [('', '595', '5539')]
Num4 Fixed: [('(770)', '595', '5539')]


**The biggest takaway from the above example is that regular expressions get complicated quickly and that finding ways to simplify regex queries is optimal but can be time consuming.**