# Meet Regular Expressions

## What?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
- With RegEx patterns we can:
    - Does this string match a pattern?
    - Is there a match for the pattern anywhere in the string?
    - Modify + split strings in various ways
    
re library functions
- `re.search` scans through a string, looking for any location where the RE matches.
- `re.findall` Finds all substrings where the RE matches; returns a list.
- `re.split` splits a string on a given regex pattern, removing that pattern. The result is a list of a strings.
- `re.sub` allows us to match a regex and substitute in a new substring for the match.


## So What?
- Power + precision
    - Cost is learning something new and potentially unfamiliar.
    - Payoff is a language that works with any other programming language to operate on text and character patterns.
- Regular Expressions are cross platform and available in many programming languages and environments:
    - Command line tools (Linux, Windows, Mac, etc...)
    - Python
    - SQL flavors offer RegEx
    - Java (Scala/Clojure)
    - Other languages like Julia, Ruby, PHP, C#, etc...
    - Like SQL, there are differences between some of the different RegEx implementations, but if you know your RegEx, you can bring value in many environments.

## When is RegEx the right tool or wrong tool?
- If you can solve the problem with built-in string methods in your language, do so.
- If you need more capability than built-in string methods
- If you're parsing HTML, JSON, or XML, use a tool built for those formats. Regex + html/json = don't

## Now What?
- We'll start simple by writing regex patterns to match literal characters.
- Then we will introduce metacharacters, that have special meaning and functionality.

## Key Concepts
- The RegEx metacharacters `. ^ $ * + ? { } [ ] \ | ( )` have special meanings. 
- Square brackets create a "character class". 
    - Character classes allow us to specify many OR operations
    - For example, `r"[aeiou]"` matches any lowercase vowel character. Identical to `r"a|e|i|o|u"`
    - `r"[a-z]"` matches lowercase a through z.
- Metacharacters are not active inside of the character class square brackets `[]`
- Outside of the character class `[]`, if you need to match a metacharacter character literally, you will need to put a `\` in front of that character. `r"\+"` will match the literal `+` character.
- RegEx has characters for special sequences:
    - `.` matches any character
    - `\d` matches any numeral. Is equivalent to `[0-9]`
    - `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
    - `\s` matches any white space like ` `, tab, soft return, new line etc...
    - `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
    - `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `.` Matches any character
- Repetition:
    - `*` matches zero or more of the previous pattern
    - `+` matches 1 or more of the previous pattern
- `?` after a pattern means that pattern is optional
- Not - `[^abc]` matches anything but "a" or "b" or "c"
- Anchors
    - `^` start
    - `$` end
    - `\b` word boundary
- Groups
    - `(a)`

## How Deep Does RegEx go?
- For challenging strings to match, like email addresses, recommend using pre-built RegEx specifications like  the HTML specification at https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
- With known, good, and proven RegEx patterns like these, you don't need to reinvent things.
- ```r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"```


In [1]:
import re

### Patterns to Match Literals 
> Crawl before you walk

In [2]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [5]:
# We can search for a literal match of the string Verona
# re.search(r"pattern", "our subject")
x = re.search(r"Verona", string)
x

<re.Match object; span=(47, 53), match='Verona'>

In [6]:
# the span returned is the index. 
# Consider if we were to splice the string using the span bounds
string[47:53]

'Verona'

In [7]:
re.search(r"In fair Verona", string)

<re.Match object; span=(39, 53), match='In fair Verona'>

In [8]:
# The string "Leonardo DiCaprio" is not here, so re.search returns None
re.search(r"Leonardo DiCaprio", string)

In [9]:
# re.search returns the first match
re.search(r"civil", string)

<re.Match object; span=(126, 131), match='civil'>

In [10]:
# .findall returns all matches
re.findall(r"civil", string)

['civil', 'civil']

In [11]:
# empty set for no matches with .findall
re.findall(r"Claire Danes", string)

[]

In [12]:
re.search(r"Two", string)

<re.Match object; span=(0, 3), match='Two'>

In [13]:
# Are computers particular on specifics?
re.search(r"two", string)

In [14]:
# The re.IGNORECASE flag does exactly that
re.search(r"two", string, re.IGNORECASE)

<re.Match object; span=(0, 3), match='Two'>

In [15]:
re.search(r"A", "aaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 1), match='a'>

In [16]:
re.search(r"Aaaaa", "aaaaaa", re.IGNORECASE)

<re.Match object; span=(0, 5), match='aaaaa'>

## Using `|` for a logical OR to open opportunities
- We can use `|` with literal characters or other regular expression patterns

In [17]:
# OR
# Findall returns all matches 
re.findall(r"gray|grey", "I can't remember if you spell grey gray or gray like grey!")

['grey', 'gray', 'gray', 'grey']

In [18]:
# The .search method matches only the first match
re.search(r"orange|apple", "I like both apples and oranges")

<re.Match object; span=(12, 17), match='apple'>

In [19]:
re.findall(r"this|that", "this that and the other")

['this', 'that']

In [20]:
# has a vowel, anywhere
re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [21]:
re.findall(r"a|e|i|o|u", "banana", re.IGNORECASE)

['a', 'a', 'a']

In [23]:
# carot is starts-with
# . is any character
# * is zero or more
re.search(r"^b.*", "bananarama")

<re.Match object; span=(0, 10), match='bananarama'>

In [24]:
# .* finds the largest possible match
# technical term is greedy
re.search(r"^b.*", "bananarama pajama")

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [28]:
# match the character b then any other character
re.search(r"b.", "hello bananarama pajama")

<re.Match object; span=(6, 8), match='ba'>

In [31]:
# match b followed by 3 of any character
re.search(r"b.*", "hello bananarama pajama")

<re.Match object; span=(6, 23), match='bananarama pajama'>

In [33]:
re.search(r"b.* ", "hello bananarama pajama")

<re.Match object; span=(6, 17), match='bananarama '>

In [35]:
re.search(r"[^b]", "hello bananarama pajama")

<re.Match object; span=(0, 1), match='h'>

In [36]:
# let's find something that starts with a then has any number of other characters
re.search(r"^a.*", "hello bananarama pajama")

In [37]:
re.search(r"ban.*", "hello bananarama pajama")

<re.Match object; span=(6, 23), match='bananarama pajama'>

In [38]:
re.search(r"a.*", "hello bananarama pajama")

<re.Match object; span=(7, 23), match='ananarama pajama'>

In [41]:
# starts with
# anything
# ends with 
re.search(r"^b.*rama", "bananarama pajama")

<re.Match object; span=(0, 10), match='bananarama'>

In [45]:
re.search(r".*jama$", "bananarama pajama")

<re.Match object; span=(0, 17), match='bananarama pajama'>

In [46]:
re.search(r".*rama", "bananarama pajama")

<re.Match object; span=(0, 10), match='bananarama'>

In [47]:
# \w matches [a-zA-Z0-9]
re.search(r"\w", "abc123")

<re.Match object; span=(0, 1), match='a'>

In [48]:
re.search(r"\w\w\w", "abc123")

<re.Match object; span=(0, 3), match='abc'>

In [51]:
re.search(r"\w\w\w\w\w\w", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [53]:
# seven \w characters will only match seven of any [a-zA-Z0-9]
re.search(r"\w\w\w\w\w\w\w", "abc123")

In [54]:
re.search(r"\w*", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [55]:
# curly braces for repetition
re.search(r"\w{3}", "abc123")

<re.Match object; span=(0, 3), match='abc'>

In [57]:
re.search(r"\w{1,6}", "abc123")

<re.Match object; span=(0, 6), match='abc123'>

In [59]:
# {n,} matches n or more times
re.search(r"\w{1,}", "abc123 is the place to be")

<re.Match object; span=(0, 6), match='abc123'>

In [60]:
# {n,} matches n or more times
re.findall(r"\w{1,}", "abc123 is the place to be")

['abc123', 'is', 'the', 'place', 'to', 'be']

In [65]:
# {n,} matches n or more times
re.findall(r"\w{1,6}", "abc123 is the place to be banaramapajama")

['abc123', 'is', 'the', 'place', 'to', 'be', 'banara', 'mapaja', 'ma']

In [67]:
# {n,} matches n or more times
# space after the 1-6 alphanumeric \w matches
re.findall(r"\w{1,6} ", "abc123 is the place to be banaramapajama")

['abc123 ', 'is ', 'the ', 'place ', 'to ', 'be ']

In [66]:
# {n,} matches n or more times
re.search(r"\w{1,6}", "abc123 is the place to be banaramapajama")

<re.Match object; span=(0, 6), match='abc123'>

In [63]:
# r"\w+" is the same as r"\w{1,}"
re.findall(r"\w+", "abc123 is the place to be")

['abc123', 'is', 'the', 'place', 'to', 'be']

In [70]:
# 3 digits then a single character of any then 4 digits
re.search(r"[0-9]{3}.[0-9]{4}", "226-3232")

<re.Match object; span=(0, 8), match='226-3232'>

In [71]:
# 3 digits then a single character of any then 4 digits
re.search(r"[0-9]{3}.[0-9]{4}", "226.3232")

<re.Match object; span=(0, 8), match='226.3232'>

In [73]:
# What if the delimiter is optional?
# question mark metacharacter means the thing to the left of the ? is optional
re.search(r"[0-9]{3}.?[0-9]{4}", "2263232")


<re.Match object; span=(0, 7), match='2263232'>

In [76]:
re.search(r"[0-9]{3}.?[0-9]{4}", "226-3232")


<re.Match object; span=(0, 8), match='226-3232'>

In [77]:
re.search(r"[0-9]{3}.?[0-9]{4}", "2263232")

<re.Match object; span=(0, 7), match='2263232'>

## Using a RegEx pattern to split a string
- The `re.split` method returns a list of strings
- The matching substring is removed
- We can split on any regex pattern, not only character literals

In [None]:
# Split the phone number on the
re.split(r"-", "210-226-3232")

In [None]:
# Splits the string on the space character
# The \ is necessary
re.split(r" ", "this that and the other")

In [None]:
# Parse these songs into a dataframe containing 2 columns: artist_name and song_name
# Hint: break the string into an array of strings that hold each song/artist record
songs = "Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"
songs

## [Character Classes]
- Square brackets make character classes 
- Character classes provide OR behavior
- In a character classe, `^` works as a "None of" operator
- Metacharacters match their literal character when inside of square brackets for a character class

In [None]:
# has a vowel, anywhere

re.search(r"[aeiou]", "banana", re.IGNORECASE)

In [None]:
# The parentheses around 
re.findall(r"gr[ae]y", "Some people spell gray like grey")

In [None]:
# has a vowel, anywhere

re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

In [None]:
# Is a vowel

assert bool(re.search(r"^[aeiou]{1}$", "a", re.IGNORECASE)) == True
assert bool(re.search(r"^[aeiou]{1}$", "aaaa", re.IGNORECASE)) == False

In [None]:
# is only vowels

re.search(r"^[aeiou]*$", "aaeeeaa")

In [None]:
# has a p or q, anywhere
re.search(r"p|q", "albuquerque", re.IGNORECASE)

In [None]:
# has a p or q, anywhere
re.search(r"[pq]", "albuquerque", re.IGNORECASE)

In [None]:
# is p or q
re.search(r"^[pq]{1}$", "q", re.IGNORECASE)

In [None]:
# is only Ps and Qs
assert bool(re.search(r"^[pqPQ]*$", "pqpqpqpPQQQQQQQQp")) == True
assert bool(re.search(r"^[pq]*$", "b3qwpeop")) == False

In [None]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

re.findall(r"civil\s.{5}", string)


## Repetition characters and Special Sequences
> Walk before you run

- `.` means any single character
- `*` means zero or more characters
- `+` means one or more characters
- `.` matches any character
- `\d` matches any decimal. Is equivalent to `[0-9]`
- `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
- `\s` matches any white space like ` `, tab, soft return, new line etc...
- `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
- `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `{n)` exactly n characters
- `{n,}` n or more characters
- `{n, m}` n to m times

## Groups