# Meet Regular Expressions

## What?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
- With RegEx patterns we can:
    - Does this string match a pattern?
    - Is there a match for the pattern anywhere in the string?
    - Modify + split strings in various ways
    
re library functions
- `re.search` scans through a string, looking for any location where the RE matches.
- `re.findall` Finds all substrings where the RE matches; returns a list.
- `re.split` splits a string on a given regex pattern, removing that pattern. The result is a list of a strings.
- `re.sub` allows us to match a regex and substitute in a new substring for the match.


## So What?
- Power + precision
    - Cost is learning something new and potentially unfamiliar.
    - Payoff is a language that works with any other programming language to operate on text and character patterns.
- Regular Expressions are cross platform and available in many programming languages and environments:
    - Command line tools (Linux, Windows, Mac, etc...)
    - Python
    - SQL flavors offer RegEx
    - Java (Scala/Clojure)
    - Other languages like Julia, Ruby, PHP, C#, etc...
    - Like SQL, there are differences between some of the different RegEx implementations, but if you know your RegEx, you can bring value in many environments.

## Now What?
- We'll start simple by writing regex patterns to match literal characters.
- Then we will introduce metacharacters, that have special meaning and functionality.
- 

## Key Concepts
- The RegEx metacharacters `. ^ $ * + ? { } [ ] \ | ( )` have special meanings. 
- Square brackets create a "character class". 
    - Character classes allow us to specify many OR operations
    - For example, `r"[aeiou]"` matches any lowercase vowel character. Identical to `r"a|e|i|o|u"`
    - `r"[a-z]"` matches lowercase a through z.
- Metacharacters are not active inside of the character class square brackets `[]`
- Outside of the character class `[]`, if you need to match a metacharacter character literally, you will need to put a `\` in front of that character. `r"\+"` will match the literal `+` character.
- RegEx has characters for special sequences:
    - `\d` matches any decimal. Is equivalent to `[0-9]`
    - `\D` matches any non-digit character and is equivalent to `[^0-9]`. 
    - `\s` matches any white space like ` `, tab, soft return, new line etc...
    - `\w` matches any alphanumeric character and underscore. Equivalent to `[0-9a-zA-Z_]`
    - `\W` matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`
- `.` Matches any character
- Repetition:
    - `*` matches zero or more of the previous pattern
    - `+` matches 1 or more of the previous pattern
- `?` after a pattern means that pattern is optional
- Not
- Anchors

## How Deep Does RegEx go?
- For challenging strings to match, like email addresses, recommend using pre-built RegEx specifications like  the HTML specification at https://html.spec.whatwg.org/multipage/forms.html#valid-e-mail-address
- With known, good, and proven RegEx patterns like these, you don't need to reinvent things.
- ```r"^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$"```

In [1]:
import re

### Patterns to Match Literals 
> Crawl before you walk

In [2]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

'Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean.'

In [3]:
# We can search for a literal match of the string Verona
x = re.search(r"Verona", string)

In [4]:
# the span returned is the index. 
# Consider if we were to splice the string using the span bounds
string[47:53]

'Verona'

In [5]:
re.search(r"In fair Verona", string)

<re.Match object; span=(39, 53), match='In fair Verona'>

In [6]:
# The string "Leonardo DiCaprio" is not here, so re.search returns None
re.search(r"Leonardo DiCaprio", string)

In [7]:
# re.search returns the first match
re.search(r"civil", string)

<re.Match object; span=(126, 131), match='civil'>

In [8]:
# .findall returns all matches
re.findall(r"civil", string)

['civil', 'civil']

In [9]:
# empty set for no matches with .findall
re.findall(r"Claire Danes", string)

[]

In [10]:
re.search(r"Two", string)

<re.Match object; span=(0, 3), match='Two'>

In [11]:
# Are computers particular on specifics?
re.search(r"two", string)

In [12]:
# The re.IGNORECASE flag does exactly that
re.search(r"two", string, re.IGNORECASE)

<re.Match object; span=(0, 3), match='Two'>

## Using `|` for a logical OR to open opportunities
- We can use `|` with literal characters or other regular expression patterns

In [13]:
# OR
# Findall returns all matches 
re.findall(r"gray|grey", "I can't remember if you spell grey gray or gray like grey!")

['grey', 'gray', 'gray', 'grey']

In [14]:
# The .search method matches only the first match
re.search(r"orange|apple", "I like both apples and oranges")

<re.Match object; span=(12, 17), match='apple'>

In [15]:
re.findall(r"this|that", "this that and the other")

['this', 'that']

In [16]:
# has a vowel, anywhere
re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

## Using a RegEx pattern to split a string
- The `re.split` method returns a list of strings
- The matching substring is removed
- We can split on any regex pattern, not only character literals

In [17]:
# Split the phone number on the
re.split(r"-", "210-226-3232")

['210', '226', '3232']

In [18]:
# Splits the string on the space character
# The \ is necessary
re.split(r" ", "this that and the other")

['this', 'that', 'and', 'the', 'other']

In [19]:
# Parse these songs into a dataframe containing 2 columns: artist_name and song_name
# Hint: break the string into an array of strings that hold each song/artist record
songs = "Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"
songs

"Harry_Belafonte_-_Jump_In_the_Line.mp3,Willie_Mae_'Big_Mama'_Thornton_-_Hound_Dog.mp3,Tina_Turner_-_Proud_Mary.mp3,Prince_-_Purple_Rain.mp3"

## [Character Classes]
- Square brackets make character classes 
- Character classes provide OR behavior
- In a character classe, `^` works as a "None of" operator
- Metacharacters match their literal character when inside of square brackets for a character class

In [20]:
# has a vowel, anywhere

re.search(r"[aeiou]", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [21]:
# The parentheses around 
re.findall(r"gr[ae]y", "Some people spell gray like grey")

['gray', 'grey']

In [22]:
# has a vowel, anywhere

re.search(r"a|e|i|o|u", "banana", re.IGNORECASE)

<re.Match object; span=(1, 2), match='a'>

In [23]:
# Is a vowel

assert bool(re.search(r"^[aeiou]{1}$", "a", re.IGNORECASE)) == True
assert bool(re.search(r"^[aeiou]{1}$", "aaaa", re.IGNORECASE)) == False

In [24]:
# is only vowels

re.search(r"^[aeiou]*$", "aaeeeaa")

<re.Match object; span=(0, 7), match='aaeeeaa'>

In [25]:
# has a p or q, anywhere
re.search(r"p|q", "albuquerque", re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [26]:
# has a p or q, anywhere
re.search(r"[pq]", "albuquerque", re.IGNORECASE)

<re.Match object; span=(4, 5), match='q'>

In [27]:
# is p or q
re.search(r"^[pq]{1}$", "q", re.IGNORECASE)

<re.Match object; span=(0, 1), match='q'>

In [28]:
# is only Ps and Qs
assert bool(re.search(r"^[pqPQ]*$", "pqpqpqpPQQQQQQQQp")) == True
assert bool(re.search(r"^[pq]*$", "b3qwpeop")) == False

In [29]:
string = "Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean."
string

re.findall(r"civil\s.{5}", string)


['civil blood', 'civil hands']

## Repetition characters and Special Sequences
> Walk before you run