## Regular Expressions 
* tool for searching and matching parts of a text by describing the patterns to identify those parts
* a set of symbols representing a text pattern
* formal language interpreted by a regular experssion processor
* used for matching, searching, and replacing text
* used by programming languages
* "Regex" for short

* Matches
  + a regular expression matches text if it correctly describes the text
  + text matches a regular expression if it is correctly described by the expression
  
* online javascript app to practice Regex:
  + https://regexr.com
  + https://regex101.com
  + https://regexpal.com

### Regular expression conventions:
* regular expressions are expressed between two forward slashes / as the delimitor
  + /abc/ refers to "abc", both slashes are not part of the expression
  + most of the time you use without forward slashes, eepending on the programming languages
  + text strings, such as "abc" are used to match regex, but may not need ""
  + flags means different modes we use in regex
    - g for global
    - i for case insensitive (not recommended most of the time)
    - m for multiline
    - s for single line (dotall)
    - u for unicode
    - y for sticky
* modes of the regex
  + standard: /re/ match re exactly once for the first match (find)
  + global:   /re/g match re over and over again through the document (find_all)
  + case insensitive: /re/i 
  + multiline: /re/m can regex match text that stretches across multiple line, or does text have to be on the same line
  + grep in unix means g/re/p: global regular expression print

### Characters
* literal 
* metacharacters
* wildcard metacharacters
* escaping metacharacters
* other special characters

#### literal characters
* simply match all 
  + /car/ matches "car"
  + /car/ matches the first 3 letters of "carnival"
  + case-sensitive by default

* standard(non-global) matching
  + the earliest (leftmost) match is always prefered. the engine works from left to right
  + when the 1st match is found, it stops
  
* global matching
  + all matches are found throughout the text
  
* matching
  + backtrack it does. cat regex to match camel, when t doesn't match m, it goes back to compare a in camel to c
  + regex are eager. they will always return the first match  

#### Metacharacters
* characters with special meaning
* transform literal characters into powerful experssions
* only a few to learn
  + `\.*+-{}[]^$|?():!=`
  + can have more than one meaning
* most of the skills are about how to use these metacharacters in regex
* regexr.com tool colors different metacharacter symbols with colors to help understand complex regex

#### wildcard metacharacter
* matches anything single character except for a line break (.)
* /h.t/ matches hat, hot, and hit, but not heat
* broadest match possible
* most common metacharacter
* most common mistake
 + /9.00/ matches "9.00", "9500" and "9-00"
* a good regex should match the text you want to target and only that text, nothing more 

#### escaping metacharacters
* escape the next character (\)
* escape means to treat the use of metacharacters as literal characters
* match a literal period with /\,/
  + `/9\.00/` matches "9.00", but not "9500" or "9-00"
* escaping metacharacters only for metacharacters
  + literal characters should never be escaped; may give them meaning
  + quotation marks are not metacharacters; do not need to be escaped
* escaping metacharacters, especially wild card is important!
  + to match both `.txt` in "his_export.txt her_export.txt" should use `/h.._export\.txt/g`  

#### Other special characters
* spaces
* tabs (\t): literal t is escaped as a tab
* line returns (`\r, \n, \r\n`)
  + `\r` line return
  + `\n` new line
  + which line returns is used depends on operating systems

### exercise: search in a document to check
* how many times the word "self" appear (bot upper and lower cases)?
* count himself, herself, itself, myself, yourself, thyself
* use three literal characters and three wildcard characters, match: please, palace, parade

### Character Sets
* define a character set by `[ and ]` that begins and ends a character set
* a character set 
  + can match any one of serveral characters
  + it matches only one character
  + order of characters in the set doesn't matter
  + `/[aeiou]/` matches any one vowel
  + `/gr[ea]y` matches "grey" and "gray"
  + `/gr[ea]t/` doesn't match "great" because character sets only match one character in text
  
### Character Ranges
* use - to indicate a range of characters
* include all characters between two characters
  + assume that characters have some sort of order to know what characters should go between two characters
  + numbers
  + letters
  + - is only a metacharacter inside a character set. If outside of a character set, it is aliteral dash
  + `[0-9], [A-Za-z], [a-dw-y]
  + `[50-99]` will not find numbers from 50-99. range only represent one character
    - the engine will read `[50-99]` as 5, 0-9, and 9, which is just 0-9
    - it is not a number range, but a character range
    
### Negative Character Sets
* ^ is going to negate a character set
  + the character to match is NOT any one of several characters
  + add ^ as the first character inside a character set
  + `/[^aeiou]/` matches any one consonant (non-vowel)
  + `/see[^mn]/` matches "seek" and "sees", but not "seem" or "seen"
  + `/see[^mn]/` does not match "see", since negative character set still looks for a character
  + `/see[^mn]/` does match "see " and "see." because " " and "," are still characters
  + `/[^a-zA-Z]/` negates all lower and upper case letters
  
### Metacharacters inside Character sets
* most metacharacters inside character sets are already escaped and considered as literal characters
* `/h[a.]t/` matches "hat" and "h.t", but not "hot" since . is a literal character
  + exceptions that we need to escape: `] - ^ \` because
    - ] is the end of character set, if you need literal ], you need to escape it
    - - defines character range as a metacharacter. you need to escape it to use it as a literal character
    - ^ is used to negate the character set, you need to escape it to use it as a literal character
    - \ is used to escape the previous ], -, and ^, so you need to escape it to use it as a literal character
* examples:
  + `/var[[(][0-9][\])]/` escape ] by `\]`
  + `/file[0\-\\_]1/` escape - and \ by `\-` and `\\`
  
### Shorthand Character Sets
* \d means any one digit, is the same as `[0-9]`
* \w means word character, is the same as `[a-zA-Z0-9_]`
* \s means whitespace, is the same as `[ \t\r\n]`
* \D means negative digit, is the same as `[^0-9]`
* \W means negative word character, is the same as `[^a-zA-Z0-9]`
* \S means negative whitespace, is the same as `[^ \t\r\n]`
* Caution: \w includes underscore but not hyphen
* examples:
  + `/\d\d\d\d/` matches "1984", but not "text"
  + `/\w\w\w/` matches "ABC", "123", and "1_A"
  + `/\w\s\w\w/` matches "I am", but not "Am I" by the position of space
  + `/[\w\-]/` matches any word character or hyphen (useful). It combines word character with escaped literal character -
  + `/[^d]/` is the same as both `/\D/` and `/[^0-9]/`
* caution: `/[^\d\s]/` is not the same as `[\D\S]`
  + `/[^\d\s]/` means match a character that is not a digit, or not a whitespace, neither of them
  + `[\D\S]` means match a character that is either not a digit or not a space character
    - 1 is a digit, but not a space, so it will be matched
    - a is not a digit, and not a space, so it will be matched    

### Exercise
* apply global regular expression to the text "Self-Reliance"
* Match both "lives" and "lived"
* Match "virtue" but not "virtues"
* Match the numbers and periods on all numbered paragraphs
* Find the 16-character word that starts with "c"

In [None]:
`/live[sd]/`
`/virtue[^s]/`
`/\d\./`
`/c\w{15}/`

### Repetition Metacharacters (regular repitition)
* `*` preceding item, zero or more times
* `+` preceding item, one or more times
* `?` preceding item, zero or one time

#### examples of repetition metacharacters
* `/.+/` matches any string of characters except a line return
*`/Good .+\./` matches "Good morning.", "Good day.", "Good evening.", and "Good night."
  + pattern is Good + one space + anything at least once + .
* `/\d+/` matches "90210"
* `/\s[a-z]+ed\s/` matches lower case letters after a whitespace ending in "ed "
* `/apples*/` matches "apple", "apples", "applesssss"
* `/apples+/` matches "apples", "applessssss", but not "apple"
* `/apples?/` matches "apple", "apples", but not "applessssss"
* `/\d\d\d\d*/` equals to `/\d\d\d+/`
* `/colou?r/` matches "color" and "colour"

### Quantified Repetition
* define exactly how many times to repeat a pattern (4 times, or 4, 5, or 6 times, but not other repeats)
* {min, max}
  + min and max are positive numbers
  + min must always be included and can be zero
  + max is optional
* three syntax:
  + \d{4,8} matches numbers with four to eight digits
  + \d{4} matches numbers with exactly four digits (min is max)
  + \d{4,} matches numbers with four or more digits (max is infinite)
* overlapping between quantified repetition and regular repetition 
  + \d{0,} is equavalent to \d*
  + \d{1,} is equavalent to \d+
* Examples:
  + `/\d{3}-\d{3}-\d{4}/` matches most US phone numbers
  + `/A{1,2} bonds/` matches "A bonds" and "AA bonds", but not "AAA bonds"
  

### Greedy Expression
* especially important when using reptition expression
  + strings are of an indetermined length
  + may match several different things, how engine make the choice by default
* examples:
  + for `/\d+\w+\d+/` with text "01_FY_07_report_99.xls"
    - 01_FY_07
    - 01_FY_07_report_99
  + for `/".+", ".+"/` with text "Milton", "Waddams", "Initech, Inc."
    - ("Milton"), ("Waddams")
    - ("Milton"), ("Waddams", "Initech, Inc.")
    - (Milton", "Waddams"), ("Initech, Inc.")
* standard repetition quantifiers are greedy
* expression tries to match the longest possible string
  + if you have a `.+` sign, it will try and eat up as much of the possible string with that wildcard
  + it still defers to achieving overall match
  + example: `/.+\.jpg/` matches "filename.jpg". It will not eat up `.jpg` by `.+`
* being greedy means gives back as little as possible to the next part of match
  + `/.*[0-9]+/` matches "Page 266"
    - `.*` matches "Page 26"
    - `[0-9]+` matches only "6"
    - greedy means match as much as possible before giving control to the next expression part
      + the engine first goes through "Page 266" and matches it to `.*`
      + then when it goes to `[0-9]+`, it gives control to the second part, and goes back to match "6"
  + `/\d+\w+\d+/` to match "01_FY_07_report_99.xls"
    - engine goes to "01" and matches `\d+`
    - engine then goes to `_FY_07_report_99` and matches `\w+`
    - engine goes to match `\d+` and couldn't find matches, it goes back and matches "9"
    - finally, the matches are the following:
      + `\d+` matches `01`
      + `\w`matches `_FY_07_report_9`
      + `\d+` matches "9"
      + returns "01_FY_07_report_99"
  + `/".+", ".+"/` with text "Milton", "Waddams", "Initech, Inc."
    - returns ("Milton", "Waddams"), ("Initech, Inc.")

### Lazy Expression
* using ? to make preceding quantifier lazy
* instead of using ? to define repeat times as 0 or 1, we use ? after the other repeat metacharacters
  + `*?`
  + `+?`
  + `{min,max}?`
  + `??`
* it instructs quantifier to use a "lazy strategy" for making choices, means
  + match as little as possible before giving control to the next expression part
  + still defers to overall match
  + not necessary faster or slower, just different strategies to return different results
* examples
  + `/.*?[0-9]+/` for "Page 266" should be "Page ", and "266" for the two parts (returns "Page 266"
  + `/.*?[0-9]+?/` for "Page 266" should be "Page ", and "2" for the two parts (returns "Page 2")
  + `/\d+\w+?\d+/` to match "01_FY_07_report_99.xls" gives "01_FY_07"
  + `/".+", ".+"/` to match "Milton", "Waddams", "Initech, Inc." returns "Milton", "Waddams"

### Challenge: Repeatition
* apply global regular expressions to the text "Self-Reliance"
* match: self, himself, herself, itself, myself, yourself, thyself
* match both "virtue" and "virtues"
* use quantified repetition to find the word that starts with "T" and has 12 letters
* match all text inside quotation marks, but nothing that is not inside them

#### Answers:
* `/\w{0,4}self/`
* `/virtues?/`
* `/T[a-zA-Z]{11}/`
* `/"(.|\n)+?"/` 
  + (at least once) just goes to the next quote, and stop, rather than go to the last quote
  + to include line feed, we can not use character set with ., since in `[.]`, `.` is a literal character
  + we use grouping and alternation, so it can be a wildcard except line return, or a line return (.|\n)

### Grouping Metacharacters
* Anything inside () is a group expression
* group portions of the expression so we can use them in different ways:
  + Apply repetition operators to a group
  + create a group of alternation expressions
  + captures group for use in matching and replacing
* Examples of applying repetition operators to a group:
  + define a group of (abc), and apply repetition operators
    - `/(abc)+/` matches "abc" and "abcabcabc". the entire group is repeated (unlike a single character in characher set)
    - `/(in)?dependent/` matches "independent" and "dependent"
    - `/run(s)?/` the same as `/runs?/` but just more readable
* Examples of capturing group for use in matching and replacing
  + group the Regex parts and later on capture groups
      - `/(\d{3})-(\d{3})-(\d{4})/` matches a phone number "555-666-7890".
      we can now access "555", "666" and "7890" by $1, $2 and $3

### Alternation Metacharacters
* | is an OR operator
* either match expression on the left or on the right
* ordered, leftmost expression gets precedence
* multiple choices can be daisy-chained
* group alternation experssions to keep them distinct

* Examples:
  + `/apple|orange/` matches "apple" and "orange"
  + `/abc|def|ghi|jkl/` matches "abc", "def", "ghi", and "jkl"
  + `/apple(juice|sauce)/` is not the same as `/applejuice|sauce/`
    - "applejuice" or "applesauce" for `/apple(juice|sauce)/`
    - "applejuice" or "sauce" for `/applejuice|sauce/`
  + `/w(ei|ie)rd/` matches "weird" and "wierd"
  + `/(AA|BB|CC){4}/` matches "AABBAACC" and "CCCCBBBB". It doesn't matter which of AA, BB, CC, provided there appear 4 times

### Efficiency when using alternation
* engine is eager and greedy
* result of being eager:
  + `/(peanut|peanutbutter)/g` with "peanutbutter" returns peanut due to eager match
  + `/(xyx|abc|def|ghi|jkl)/` match "abcdefghijklmnopqrstuvwxyz" will match abc
    - engine will not start by regex, but start with the text/document and find first match
    - engine will compare each content from document and find the patterns in regex
      + put simplest(most efficient) expression first in regex, since engine matches from left to right
      + `/\w+_\d{2,4}|\d{4}_export|export_\d{2}/` is not good, but `/export_\d{2}|\d{4}_export\w+_\d{2,4}/` is
        - because export_\d{2} is the most specific, and \w+_\d{2,4} is the most complex (most variety of chars and repeatitions
* literal text and small character sets are more effiecient and should be put first in alteration groups            
* result of being greedy (when repetition is used)
  + `/peanut(butter)?/g` with "peanutbutter" returns peanutbutter due to greedy match (match as many as possible)
  + `/peanut(butter)??/g` with "peanutbutter" returns peanut  

#### Exercise for efficiency using group alteranation
* apply global regular expression to the text "Self-Reliance"
* Match "myself", "yourself", "thyself", but not "himself", "herself", "itself"
* Match "good", "goodness" and "goods" without typing "good" more than once
* match "do" or "does" followed by "no", "not" or "nothing" even when it occurs at the start of a sentence

`/(my|your|thy)+self/`
`/good(ness|s)?/`
`/[Dd]o(es)? (nothing|not|no}/` or `/[Dd]o(es)? no(t(hing)?)?/` or `/[Dd]o(es)? no(thing|t)?/`

### Start and End Anchors
* start and end anchors referce a position, not an actual character
* it is zero-width
* `^` start of string/line
* `$` endo of string/line
* `\A` start of string, never end of line
* `\Z` endo of string, never end of line

#### Examples
* `/^apple/` or `/\Aapple/`
* `/apple$/` or `/apple\Z/` 
* `/^apple$/` or `/\Aapple\Z/` 


#### modes of regex when using anchors
* single-line mode
  - default
  - all four anchors do not match at line breaks
* multi-line mode
  - \A and \Z will not match at line break. can be used to match the entire file/string
  - ^ and $ will match start and end of lines  

In [1]:
import re
re.search("^regex$", "string", re.MULTILINE)

### Word boundaries
* reference a position, not an actual character
* before the first word character in the string
* afer the last word character in the string
* between a word character and a non-word character
* word characters: `[A-Za-z0-9_]`
* any time we change from something in this character set to something that is not in this set, it's a word boundary
#### Examples:
* `/\b\w+\b/` finds four matches in "This is a test."
  + "This" because it has a word boundary before T and a space, which is not a word character
  + "is" because it has a space before and after it, which is not a word character
  + "a" because it has a space before and after it, which is not a word character
  + "test". it has a space and a . after it, which are not word characters
* `/\b\w+\b/` matches all of "abc_123", but only part of "top-notch"  
* `/\bNEW\bYork\b/` does not match "NEW York" because there is a work boundary between w and space, and another between space and York
* `/\bNEW\b \bYork\b/` matches "NEW York"
* `/\B\w+\B/` finds two matches in "This is a test" ("hi" and "es")
* find all letter 'e' that is at the end of the words (all 'e's followed by a word boundary)
  + `/e\b/`
  + e can be followed by space, or :, or ? etc. when the word ends
* find all letter 'a' that is standalone, may be separated from other letters by space, ?, :, ; etc.
  + `/\ba\b/`
* find word that ends with S
  + `/\b\w+s\b/` for "We picked apples"
    - engine only look for the position that has \b at the start, it will not back-track other positions
    - these positions include "icked", once engine checks there is no \b before them, it skips to the next character

#### Exercise for efficiency using group alteranation
* apply global regular expression to the text "Self-Reliance"
* how many paragraphs start with "I" as in "I read"?
* how many paragraphs end with a question mark?
* match all words with 15 letters, including hyphenated words

#### answers
* `/^I\b/`  (multi-line mode)
* `/\?$/`  (multi-line mode, need to use \? to escape the quantifier of ? to literal character) 
* `/\b[\w\-]{15}\b/`

### Python re module
group in re module

In [4]:
import re
string = "John has 6 cats but I think my friend Susan has 3 dogs and Mike has 8 fish"
re.findall('[A-Za-z]+ \w+ \d+ \w+', string)

['John has 6 cats', 'Susan has 3 dogs', 'Mike has 8 fish']

In [5]:
#### when using group, re.findall only print out groups
re.findall('([A-Za-z]+) \w+ \d+ \w+', string)

['John', 'Susan', 'Mike']

In [6]:
# each element in the group will be output as tuple elements if there are multiple groups
re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)

[('John', '6', 'cats'), ('Susan', '3', 'dogs'), ('Mike', '8', 'fish')]

In [7]:
# use zip to re-organize the groups
info = re.findall('([A-Za-z]+) \w+ (\d+) (\w+)', string)
list(zip(*info))

[('John', 'Susan', 'Mike'), ('6', '3', '8'), ('cats', 'dogs', 'fish')]

In [21]:
# we can also put all the regex sub-groups into a big group
# we can then retrieve whatever instance/group combination we want
data = re.findall('(([A-Za-z]+) \w+ (\d+) (\w+))', string)
data

[('John has 6 cats', 'John', '6', 'cats'),
 ('Susan has 3 dogs', 'Susan', '3', 'dogs'),
 ('Mike has 8 fish', 'Mike', '8', 'fish')]

### using search in re module
* search returns the first matched instance

In [8]:
match = re.match('([A-Za-z]+) \w+ (\d+) (\w+)', string)
match

<re.Match object; span=(0, 15), match='John has 6 cats'>

In [9]:
# group(0) returns the entire matched instance
match.group(0)

'John has 6 cats'

In [10]:
# groups() returns all the groups of the matched instance as a tuple
match.groups()

('John', '6', 'cats')

In [11]:
# we can get each group content
print(match.group(1))
print(match.group(2))
print(match.group(3))

John
6
cats


In [12]:
match.group(1, 3)

('John', 'cats')

In [14]:
match.group(3, 2, 1, 1)

('cats', '6', 'John', 'John')

In [16]:
#### match.span() gives the start and end position of the match
# if no arguments to span, it will default to group 0
match.span()

(0, 15)

In [18]:
# the start and end positions of group 1
match.span(1)

(0, 4)

In [19]:
# the start and end positions of group 2
match.span(2)

(9, 10)

In [22]:
### use re.finditer
it = re.finditer('([A-Za-z]+) \w+ (\d+) (\w+)', string)

In [23]:
# each time it is iterated, it will run search method and returns a match object
next(it).group()

'John has 6 cats'

In [24]:
next(it).group(2)

'3'

In [26]:
it = re.finditer('([A-Za-z]+) \w+ (\d+) (\w+)', string)
for element in it:
    print(element.group(3, 1, 2))

('cats', 'John', '6')
('dogs', 'Susan', '3')
('fish', 'Mike', '8')


#### group in re.search() overwrites so you only get the basic pattern
* it consumes/matches all the possible text as greedy mode
* the group only show you the basic pattern of the last match
  + re.search() only returns the last match of the group pattern!

In [28]:
string = "ababababab"
match = re.search('(ab)+', string)
match

<re.Match object; span=(0, 10), match='ababababab'>

In [29]:
match.groups()

('ab',)

In [31]:
# 1st (ab)+ consume all the character except the last 'ab' since its in greedy mode
match = re.search('(ab)+(ab)+', string)
match.groups()

('ab', 'ab')

In [33]:
# to confirm 1st (ab)+ consumes all the character except the last 'ab'
# get the indices of the second group match
match.span(2)

(8, 10)

### Non-captured group
* when you want to use group to find/search a pattern, but do not want to output/save the group
* syntax for non-capture groups: ?:
* syntax for naming groups: ?p

In [43]:
string ="1234 56789"
re.findall('(\d)+', string)

['4', '9']

In [49]:
# re.findall returns the entire matched string since there is no group to output
re.findall('(?:\d)+', string)

['1234', '56789']

In [50]:
# this is the same as non-group findall
re.findall('\d+', string)

['1234', '56789']

In [47]:
# if there is a capture group, since now we have a group, re.findall outputs the group
re.findall('(?:\d)+ (\d)+', string)

['9']

In [61]:
string ="123123 = Alex, 123123123 = Danny, 123123123123 = Mike, 456456 = rick, 1212 = John"

In [63]:
# we want to pull out all names whose id has 123 within it
# the 1st group (?:\d*123) is used to find the match
# the 2nd group is what we want to output
re.findall('(?:\d*123)+ = (\w+)', string)

['Alex', 'Danny', 'Mike']

In [65]:
string = "1*1*1*1*2222 1*1*3333 2*1*2*1*222 1*222*3333 3*3*3*444"
# no group output, so the output will be the matched string
re.findall('(?:1\*){2,}\d+', string)

['1*1*1*1*2222', '1*1*3333']

#### using re.search for non-capture group
* re.search groups will output nothing

In [67]:
string ="1234 56789"
match = re.search('(?:\d)+', string)
match.groups()

()

### Backreferencing makes reference to the captured group in the same regular expression
* first define a group or groups
* refer to the group by \group_index
* the reference will be automatically replaced by the matched group, with the corresponding group index

In [70]:
# here \1 refers to the content of group 1 (\w+)
re.search(r'(\w+) \1', "Merry Merry Christmas")

<re.Match object; span=(0, 11), match='Merry Merry'>

In [71]:
# we can also check the group it matched is Merry, and it is the only group
re.search(r'(\w+) \1', "Merry Merry Christmas").groups()

('Merry',)

In [72]:
re.findall(r'(\w+) \1', "Happy Happy Holidays. Merry Christmas Christmas")

['Happy', 'Christmas']

In [73]:
re.findall(r'(\w+) \1', "Merry Merry Christmas Christmas Merry Merry Christmas")

['Merry', 'Christmas', 'Merry']