### Regular Expressions

>**Regular expressions** (called REs, or regexes, or regex patterns) are essentially **a tiny, highly specialized programming language embedded inside Python** and made available through the `re` module. [SOURCE](https://docs.python.org/3/howto/regex.html)

####  Matching exercise

In [1]:
# Import the re module-- we'll use this throughout the notebook
import re

In [2]:
text = "Call my phone at 646-666-1234!"

In [3]:
# check to see if phone is in text 
pattern = 'phone'

In [4]:
# yes, there is a match 
re.search(pattern,text)

<re.Match object; span=(8, 13), match='phone'>

In [5]:
pattern = "python"

In [6]:
# not a match, return nothing 
re.search(pattern,text)

In [7]:
# reset to phone 
pattern = 'phone'

match = re.search(pattern,text)

In [8]:
match 

<re.Match object; span=(8, 13), match='phone'>

In [9]:
# this is where the match expands, start and end

match.span()

(8, 13)

In [10]:
# what if phone appears more than once? 
text = "Call my phone at 646-666-1234! That is my phone number"
match = re.search("phone",text)

In [11]:
# it will only match the first instance 
match.span()

(8, 13)

In [12]:
# to find the match returned, use group () method 
match.group()

'phone'

In [13]:
# you need to use the findall() method to get all the matches 
text = "Call my phone at 646-666-1234! That is my phone number"
matches = re.findall("phone",text)
matches 

['phone', 'phone']

#### Regex Patterns 

In [14]:
# recall \d is for any digit 
text = "Call my phone at 646-666-1234!"

In [15]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

# can also be written as the following 
# the {} represents number of times it repeats 

phone = re.search(r'\d{3}-\d{3}-\d{4}',text)

In [17]:
# groups are separated by parenthesis so you can use group(n) to isolate 
# order starts at 1 
phone.group()

'646-666-1234'

In [18]:
# use | for or operator 

re.search(r"cell|phone","Call my phone at 646-666-1234!")

<re.Match object; span=(8, 13), match='phone'>

In [19]:
re.search(r"cell|phone","Call my cell at 646-666-1234!")

<re.Match object; span=(8, 12), match='cell'>

In [20]:
# use . for wildcard - match any character 

In [21]:
# notice how _am still counts 
#anything with 1 thing before 'am' 
re.findall(r".am","Sam I am, green eggs and ham")

['Sam', ' am', 'ham']

In [22]:
# add more . for more wildcards 
# anything with 2 things before 'am' 
re.findall(r"..am","Sam I am, green eggs and ham")

['I am', ' ham']

In [23]:
# what if I want to grab one or more non whitespace that ends with at 
# use \S for non whitespace 
# use + for 1 or more times
re.findall(r'\S+am',"Sam I am, green eggs and ham")

['Sam', 'ham']

In [24]:
# use $ for end with
# start with a number  
re.findall(r'\d$','My cell is 646-666-1234')

['4']

In [25]:
# use ^ for start with
# ends with a number  
re.findall(r'^\d','646-666-1234')

['6']

In [26]:
# use brackets for grouping 
# use \w for alphanumeric 
# use + for occurs more than one time 

In [27]:
re.findall(r'[\w]+-[\w]+-[\w]+','My cell is 646-666-1234')

['646-666-1234']

In [28]:
# to exclude use ^ and []  
# anything inside the [] will be excluded 

In [29]:
# exclude numbers 

text = 'My cell is 646-666-1234'
re.findall(r'[^\d]', text)

['M', 'y', ' ', 'c', 'e', 'l', 'l', ' ', 'i', 's', ' ', '-', '-']

In [30]:
# to put words back together use + sign 
re.findall(r'[^\d]+',text)

['My cell is ', '-', '-']

In [31]:
#use case, removing puctuation 

sentence = 'What a load of $#*!. That is weird, right!?'

In [32]:
# found list of pucntionation online" !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+ ',sentence)
re.findall('[^!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+ ',sentence)

['What a load of ', ' That is ']

In [33]:
# join sentence 
clean_sentence =  ' '.join(re.findall('[^!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+ ',sentence))

In [34]:
clean_sentence

'What a load of   That is '

In [35]:
# use () and | for multiple options 
email = 'avengers@marvel.com'
email_one = "loki@school.edu"
email_two = "thor@institute.org"

In [36]:
re.search(r'\.(com|edu|org$)',email)

<re.Match object; span=(15, 19), match='.com'>

In [37]:
re.search(r'\.(com|edu|org$)',email_one)

<re.Match object; span=(11, 15), match='.edu'>

In [38]:
re.search(r'\.(com|edu|org$)',email_two)

<re.Match object; span=(14, 18), match='.org'>

#### An example

Let's say we want to create a function that validates newschool email addresses.

The function will return True if the address has the following features:<br>
`five letters` + `three digits` + `@newschool.edu`

Otherwise, it will return False.

In [39]:
# Without regular expressions
# list of letters and digits in string format 
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'
          'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p'
          'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
          'y', 'z']

digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

def validateAddress(address):
    if (address[0] in letters and
        address[1] in letters and
        address[2] in letters and
        address[3] in letters and
        address[4] in letters and
        address[5] in digits and 
        address[6] in digits and 
        address[7] in digits and 
        address[8:] == '@newschool.edu'
       ):
        print('match')
    else:
        print('no match')

validateAddress('lastf123@newschool.edu')

match


In [40]:
# With regular expressions
# match that starts with A-z (5 times), then 0-9 (3 times, followed by newschool.edu, end)
def validateAddress(address):
    if (re.search(r"^[A-z]{5}[0-9]{3}@newschool\.edu$", address)):
        print('match')
    else:
        print('no match')

validateAddress('lastf123@newschool.edu')

match


In [41]:
# Regular expressions become very important when the pattern
# increases in flexibility. For example, creating a generic
# email validator.
def validateAddress(address):
    if (re.search(r"^[A-z0-9\.\-]+@[A-z]+\.[A-z]+$", address)):
        print('match')
    else:
        print('no match')

validateAddress('lastf123@gmail.com')

match


<br>

### General notes on regular expressions

Regular expressions use `metacharacters`:

`. ^ $ * + ? { } [ ] \ | ( )`

These metacharacters hold special meaning, giving regular expressions powerful flexibility compared to strings.

In [41]:
# Square brackets -> Sets of characters

# Letters
pattern = r"[a-z]" # All lowercase letters
pattern = r"[A-Z]" # All uppercase letters
pattern = r"[A-z]" # All letters (r"[A-Za-z]" also works)

# Digits
pattern = r"[0-9]" # All digits
pattern = r"[0-5]" # All digits from 0 to 5

# Custom set of characters
pattern = r"[A-z0-9]" # All letters and digits
pattern = r"[AEIOUaeiou]" # All vowels

# NOT in set
pattern = r"[^aeiou]" # Any symbol that is NOT a vowel
pattern = r"[^0-9]" # Any symbol that is NOT a digit

In [42]:
# Caret and dollar sign -> Beginning and end

# At the beginning
pattern = r"^[A-z]" # A letter at the beginning
pattern = r"^[0-9]" # A number at the beginning

# At the end
pattern = r"[A-z]$" # A letter at the end
pattern = r"[0-9]$" # A number at the end

# Defining both beginning and end
pattern = r"^[A-z]$" # Exactly one letter
pattern = r"^[0-9]$" # Exactly one number

In [43]:
# Curly brackets, asterisk, and question mark -> Repetition

# Curly brackets
pattern = r"[A-z]{3}" # Exactly three letters
pattern = r"[A-z]{3,5}" # Between three and five letters

# Asterisk
pattern = r"[A-z]*" # Zero or more letters

# Question mark
pattern = r"[A-z]?" # One or more letters

In [44]:
# Dot - Any character
pattern = r"." # Any character
pattern = r".{3}" # Three of any character
pattern = r"[A-z].?" # Any letter followed by one or more of any character

In [45]:
# Slash - Escape character
pattern = r"\." # A literal period symbol
pattern = r"[A-z]?\?" # One or more letters followed by a literal question mark
pattern = r"\[\]" # A literal set of square brackets

In [46]:
# Parentheses - Grouping
pattern = r"anna( banana)*" # 'anna' followed by zero or more 'banana's

In [47]:
# Bar - Logical or
pattern = r"(anna|banana)" # Either 'anna' or 'banana'
pattern = r"\.(com|edu|org$)" # Ends in a literal dot followed by 'com', 'edu', or 'org'
pattern = r"([aeiou]|[02468])" # Either a vowel or an even number