# Meet Regular Expressions

## What?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
- With RegEx patterns we can:
    - Does this string match a pattern?
    - Is there a match for the pattern anywhere in the string?
    - Modify + split strings in various ways
    
re library functions
- `re.search` scans through a string, looking for any location where the RE matches.
- `re.findall` Finds all substrings where the RE matches; returns a list.
- `re.split` splits a string on a given regex pattern, removing that pattern. The result is a list of a strings.
- `re.sub` allows us to match a regex and substitute in a new substring for the match.


## So What?
- Power + precision
    - Cost is learning something new and potentially unfamiliar.
    - Payoff is a language that works with any other programming language to operate on text and character patterns.
- Regular Expressions are cross platform and available in many programming languages and environments:
    - Command line tools (Linux, Windows, Mac, etc...)
    - Python
    - SQL flavors offer RegEx
    - Java (Scala/Clojure)
    - Other languages like Julia, Ruby, PHP, C#, etc...
    - Like SQL, there are differences between some of the different RegEx implementations, but if you know your RegEx, you can bring value in many environments.

## When is RegEx the right tool or wrong tool?
- If you can solve the problem with built-in string methods in your language, do so.
- If you need more capability than built-in string methods
- If you're parsing HTML, JSON, or XML, use a tool built for those formats. Regex + html/json = don't

## Now What?
- We'll start simple by writing regex patterns to match literal characters.
- Then we will introduce metacharacters, that have special meaning and functionality.

## Key Concepts
- The RegEx metacharacters `. ^ $ * + ? { } [ ] \ | ( )` have special meanings. 
- Square brackets create a "character class". 
    - Character classes allow us to specify many OR operations
    - For example, `r"[aeiou]"` matches any lowercase vowel character. Identical to `r"a|e|i|o|u"`
    - `r"[a-z]"` matches lowercase a through z.
- Metacharacters are not active inside of the character class square brackets `[]`
- Outside of the character class `[]`, if you need to match a metacharacter character literally, you will need to put a `\` in front of that character. `r"\+"` will match the literal `+` character.
- RegEx has characters for special sequences:
    - `.` matches any character
    - `\d` matches any numeral. Is equivalent to `[0-9]`
    - `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
    - `\s` matches any white space like ` `, tab, soft return, new line etc...
    - `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
    - `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `.` Matches any character
- Repetition:
    - `*` matches zero or more of the previous pattern
    - `+` matches 1 or more of the previous pattern
- `?` after a pattern means that pattern is optional
- Not - `[^abc]` matches anything but "a" or "b" or "c"
- Anchors
    - `^` start
    - `$` end
    - `\b` word boundary
- Groups
    - `(a)`

## How Deep Does RegEx go?
- For challenging strings to match, like email addresses, recommend using pre-built RegEx specifications like  the HTML specification at https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
- With known, good, and proven RegEx patterns like these, you don't need to reinvent things.
- ```r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"```


In [2]:
import re
import pandas as pd

### Patterns to Match Literals 
> Crawl before you walk

In [2]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [3]:
# We can search for a literal match of the string Verona
# re.search(r"pattern", "our subject")
x = re.search(r"Verona", string)
x

<re.Match object; span=(47, 53), match='Verona'>

In [4]:
# the span returned is the index. 
# Consider if we were to splice the string using the span bounds
string[47:53]

'Verona'

In [5]:
re.search(r"In fair Verona", string)

<re.Match object; span=(39, 53), match='In fair Verona'>

In [6]:
# The string "Leonardo DiCaprio" is not here, so re.search returns None
re.search(r"Leonardo DiCaprio", string)

In [7]:
# re.search returns the first match
re.search(r"civil", string)

<re.Match object; span=(126, 131), match='civil'>

In [8]:
# .findall returns all matches
re.findall(r"civil", string)

['civil', 'civil']

In [9]:
# empty set for no matches with .findall
re.findall(r"Claire Danes", string)

[]

In [10]:
re.search(r"Two", string)

<re.Match object; span=(0, 3), match='Two'>

In [11]:
# Are computers particular on specifics?
re.search(r"two", string)

In [12]:
# The re.IGNORECASE flag does exactly that
re.search(r"two", string, re.IGNORECASE)

<re.Match object; span=(0, 3), match='Two'>

In [13]:
re.search(r"A", "aaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 1), match='a'>

In [14]:
re.search(r"Aaaaa", "aaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 5), match='aaaaa'>

## Using `|` for a logical OR to open opportunities
- We can use `|` with literal characters or other regular expression patterns

In [15]:
# OR
# Findall returns all matches 
re.findall(r"gray|grey", "I can't remember if you spell grey gray or gray like grey!")

['grey', 'gray', 'gray', 'grey']

In [16]:
# The .search method matches only the first match
re.search(r"orange|apple", "I like both apples and oranges")

<re.Match object; span=(12, 17), match='apple'>

In [17]:
re.findall(r"this|that", "this that and the other")

['this', 'that']

In [18]:
# has a vowel, anywhere
re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [19]:
re.findall(r"[aeiou]", "banana", re.IGNORECASE)

['a', 'a', 'a']

In [20]:
re.findall(r"a|e|i|o|u", "banana", re.IGNORECASE)

['a', 'a', 'a']

In [21]:
# carot is starts-with
# . is any character
# * is zero or more
re.search(r"^b.*", "bananarama")

<re.Match object; span=(0, 10), match='bananarama'>

In [22]:
# .* finds the largest possible match
# technical term is greedy
re.search(r"^b.*", "bananarama pajama")

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [23]:
# match b then 1 or more alphanumerics for a word
# \w means any a-zA-Z0-9_
# + means 1 or more letters
# when the pattern hits the " " before pajama, we're done
re.search(r"^b\w+", "bananarama pajama")

<re.Match object; span=(0, 10), match='bananarama'>

In [24]:
# match the character b then any other character
re.search(r"b.", "hello bananarama pajama")

<re.Match object; span=(6, 8), match='ba'>

In [25]:
# match b followed by 3 of any character
re.search(r"b.{3}", "hello bananarama pajama")

<re.Match object; span=(6, 10), match='bana'>

In [26]:
re.search(r"b.* ", "hello bananarama pajama")

<re.Match object; span=(6, 17), match='bananarama '>

In [27]:
# [^abc] as "anything that ain't a or b or c"
re.search(r"[^b]", "hello bananarama pajama")

<re.Match object; span=(0, 1), match='h'>

In [28]:
# let's find something that starts with a then has any number of other characters
re.search(r"^a.*", "hello bananarama pajama")

In [29]:
re.search(r"ban.*", "hello bananarama pajama")

<re.Match object; span=(6, 23), match='bananarama pajama'>

In [30]:
re.search(r"a.*", "hello bananarama pajama")

<re.Match object; span=(7, 23), match='ananarama pajama'>

In [31]:
# starts with
# anything
# ends with 
re.search(r"^b.* ", "bananarama pajama")

<re.Match object; span=(0, 11), match='bananarama '>

In [32]:
# starts with
# anything
# ends with 
re.search(r"^b\w+", "bananarama pajama")

<re.Match object; span=(0, 10), match='bananarama'>

In [33]:
re.search(r".*jama$", "bananarama pajama")

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [34]:
re.search(r".*rama", "bananarama pajama")

<re.Match object; span=(0, 10), match='bananarama'>

In [35]:
# \w matches [a-zA-Z0-9_]
re.search(r"\w", "abc123")

<re.Match object; span=(0, 1), match='a'>

In [36]:
# what if we want only letters and not letters + numbers + _ character
re.search(r"[a-zA-Z]*", "stuff and things and 123")

<re.Match object; span=(0, 5), match='stuff'>

In [37]:
# what if we want only letters and not letters + numbers + _ character
# the [a-z]+ is finding any and all sequences that are only [a-zA-Z]
re.findall(r"[a-zA-Z]+", "42 $stuff a****nd things and 123")

['stuff', 'a', 'nd', 'things', 'and']

In [38]:
re.search(r"\w\w\w", "abc123")

<re.Match object; span=(0, 3), match='abc'>

In [39]:
re.search(r"\w\w\w\w\w\w", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [40]:
# seven \w characters will only match seven of any [a-zA-Z0-9]
re.search(r"\w\w\w\w\w\w\w", "abc123")

In [41]:
re.search(r"\w*", "abc123def456")

<re.Match object; span=(0, 12), match='abc123def456'>

In [42]:
# curly braces for repetition
re.search(r"\w{3}", "abc123")

<re.Match object; span=(0, 3), match='abc'>

In [43]:
re.search(r"\w{1,6}", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [44]:
# {n,} matches n or more times
re.search(r"\w{1,}", "abc123 is the place to be")

<re.Match object; span=(0, 6), match='abc123'>

In [45]:
# {n,} matches n or more times
re.findall(r"\w{1,}", "abc123 is the place to be")

['abc123', 'is', 'the', 'place', 'to', 'be']

In [46]:
# {n,} matches n or more times
re.findall(r"\w{1,6}", "abc123 is the place to be banaramapajama")

['abc123', 'is', 'the', 'place', 'to', 'be', 'banara', 'mapaja', 'ma']

In [47]:
# {n,} matches n or more times
# space after the 1-6 alphanumeric \w matches
re.findall(r"\w{1,6} ", "abc123 is the place to be banaramapajama")

['abc123 ', 'is ', 'the ', 'place ', 'to ', 'be ']

In [48]:
# {n,} matches n or more times
re.search(r"\w{1,6}", "abc123 is the place to be banaramapajama")

<re.Match object; span=(0, 6), match='abc123'>

In [49]:
# {n,} matches n or more times
re.search(r"\w{2,6}", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [50]:
# r"\w+" is the same as r"\w{1,}"
re.findall(r"\w+", "abc123 is the place to be")

['abc123', 'is', 'the', 'place', 'to', 'be']

In [51]:
# 3 digits then a single character of any then 4 digits
re.search(r"[0-9]{3}.[0-9]{4}", "226-3232")

<re.Match object; span=(0, 8), match='226-3232'>

In [52]:
# 3 digits then a single character of any then 4 digits
re.search(r"[0-9]{3}.[0-9]{4}", "226.3232")

<re.Match object; span=(0, 8), match='226.3232'>

In [53]:
# What if the delimiter is optional?
# question mark metacharacter means the thing to the left of the ? is optional
re.search(r"[0-9]{3}.[0-9]{4}", "2263232")


In [54]:
# What if the delimiter is optional?
# question mark metacharacter means the thing to the left of the ? is optional
re.search(r"[0-9]{3}.?[0-9]{4}", "2263232")


<re.Match object; span=(0, 7), match='2263232'>

In [55]:
re.search(r"[0-9]{3}.?[0-9]{4}", "226-3232")


<re.Match object; span=(0, 8), match='226-3232'>

In [56]:
re.search(r"[0-9]{3}.?[0-9]{4}", "2263232")

<re.Match object; span=(0, 7), match='2263232'>

## Using a RegEx pattern to split a string
- The `re.split` method returns a list of strings
- The matching substring is removed
- We can split on any regex pattern, not only character literals

In [57]:
"210-226-3232".split("-")

['210', '226', '3232']

In [58]:
# Split the phone number on the
re.split(r"-| ", "210 226 3232")

['210', '226', '3232']

In [59]:
# Split the phone number on the
re.split(r"-| ", "210-226-3232")

['210', '226', '3232']

In [60]:
# Splits the string on the space character
# The \ is necessary
re.split(r" ", "this that and the other")

['this', 'that', 'and', 'the', 'other']

In [61]:
# Parse these songs into a dataframe containing 2 columns: artist_name and song_name
# Hint: break the string into an array of strings that hold each song/artist record
songs = "Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"
songs

"Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"

In [62]:
# re.method(pattern, subject_string, re.IGNORECASE)

# re.split
# re.search
# re.findall

## [Character Classes]
- Square brackets make character classes 
- Character classes provide OR behavior
- In a character classe, `^` works as a "None of" operator
- Metacharacters match their literal character when inside of square brackets for a character class

In [63]:
# has a vowel, anywhere
re.search(r"[aeiou]", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [64]:
# The parentheses around 
re.findall(r"gr[ae]y", "Some people spell gray like grey")

['gray', 'grey']

In [65]:
# has a vowel, anywhere

re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [66]:
# Is only a single vowel

re.search(r"^[aeiou]{1}$", "a", re.IGNORECASE)

<re.Match object; span=(0, 1), match='a'>

In [67]:
# is only a single vowel
re.search(r"^[aeiou]{1}$", "ae", re.IGNORECASE)

In [68]:
# is only vowels
re.search(r"^[aeiou]*$", "aaeeeaa")

<re.Match object; span=(0, 7), match='aaeeeaa'>

In [69]:
# has a p or q, anywhere
re.search(r"p|q", "albuquerque", re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [70]:
# has a p or q, anywhere
re.search(r"[pq]", "albuquerque", re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [71]:
# is p or q
re.search(r"^[pq]{1}$", "q", re.IGNORECASE)

<re.Match object; span=(0, 1), match='q'>

In [72]:
# is only Ps and Qs
re.search(r"^[pqPQ]*$", "pqpqpqpPQQQQQQQQp")

<re.Match object; span=(0, 17), match='pqpqpqpPQQQQQQQQp'>

In [73]:
re.search(r"^[pq]*$", "b3qwpeop")

In [74]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

# find all the occurences of civil followed by the word immediately after "civil"
re.findall(r"civil\s[a-z]+", string)


['civil blood', 'civil hands']

## Repetition characters and Special Sequences
> Walk before you run

- `.` means any single character
- `*` means zero or more characters
- `+` means one or more characters
- `.` matches any character
- `\b` is a word boundary anchor
- `\d` matches any decimal. Is equivalent to `[0-9]`
- `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
- `\s` matches any white space like ` `, tab, soft return, new line etc...
- `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
- `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `{n)` exactly n characters
- `{n,}` n or more characters
- `{n, m}` n to m times

In [75]:
# world w/o \b word boundary
re.search(r"o\w+", "do you like apples or oranges?")

<re.Match object; span=(4, 6), match='ou'>

In [76]:
# \b means word boundary
# any word that starts with o
re.search(r"\bo\w+", "do you like apples or oranges?")

<re.Match object; span=(19, 21), match='or'>

In [77]:
# \b means word boundary
# any word that starts with o
re.findall(r"\bo\w+", "do you like apples or oranges?")

['or', 'oranges']

In [78]:
# \b means word boundary
# any word that starts with o
# without the word boundary, we get the "ou" from "you"
re.findall(r"o\w+", "do you like apples or oranges?")

['ou', 'or', 'oranges']

In [79]:
# \b means word boundary
# any word that starts with o
re.findall(r"\bo\w+", "do you like apples or oranges?")

['or', 'oranges']

## Groups

In [80]:
sentence = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''.strip()
sentence

'You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).'

In [81]:
ip_re = r'\d+(\.\d+){3}'

match = re.search(ip_re, sentence)
match[0]

'123.123.123.123'

In [82]:
# simplified for demonstration, a real url to parse urls would be much more
# complex
url_re = r'(https?)://(\w+)\.(\w+)'

protocol, domain, tld = re.search(url_re, sentence).groups()

print(f'''
protocol: {protocol}
domain:   {domain}
tld:      {tld}
''')


protocol: https
domain:   codeup
tld:      com



In [83]:
re.search(url_re, sentence).groups()

('https', 'codeup', 'com')

In [84]:
url_re = r'(?P<protocol>https?)://(?P<domain>\w+)\.(?P<tld>\w+)'

match = re.search(url_re, sentence)


In [85]:
match.groups()

('https', 'codeup', 'com')

In [86]:
match.group("domain")

'codeup'

In [87]:
match.group("tld")

'com'

In [88]:
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

In [89]:
print(f'''
groups: {match.groups()}
referencing a group by name: {match.group('tld')}
group dictionary: {match.groupdict()}
''')


groups: ('https', 'codeup', 'com')
referencing a group by name: com
group dictionary: {'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}



## A Reflection on Captured Groups
- After matching the first group or two, the need to be _highly_ specific with subsequent groups (unless there is ambiguity in the forms in the source text) likely starts to decrease. 
- For example, matching an abitrary user agent string in on its own might prove challenging. Specific user agents even more so https://jonlabelle.com/snippets/view/yaml/browser-user-agent-regular-expressions. But if that arbitrary user agent string lives in a line inside a log where to the left we've already matched a group for the method type GET|POST, and the timestamp, then matching any specific user agent string with the cleanest regex ever isn't as necessary, probabilistically, as matching any string up until but not including the bytes transmitted group to its right.
- We can sometimes rely on the regularness of forms, especially with multiple captured groups in order to capture much more challenging patterns in the middle of other more easily discerned pattern groups.

In [3]:
# Consider the following lines in a log
logs = """
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
"""
print(logs)


GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58



In [99]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
#     "jane.janet.janeway@dogood.org",
#     "jane.janeway@ang.af.mil",
]

In [112]:
pattern = re.compile(r"""
(?P<first_name>\w+)?
\.?
(?P<last_name>\w+)?
\@
(?P<domain>\w+)
\.
(?P<tld>\w+)
""", re.VERBOSE)

In [113]:
[re.search(pattern, email).groupdict() for email in emails]

[{'first_name': 'jane', 'last_name': None, 'domain': 'company', 'tld': 'com'},
 {'first_name': 'bob', 'last_name': None, 'domain': 'company', 'tld': 'com'},
 {'first_name': 'jane',
  'last_name': 'janeway',
  'domain': 'company',
  'tld': 'com'},
 {'first_name': 'jane',
  'last_name': 'janeway',
  'domain': 'dogood',
  'tld': 'org'}]