### Regular Expressions

>**Regular expressions** (called REs, or regexes, or regex patterns) are essentially **a tiny, highly specialized programming language embedded inside Python** and made available through the `re` module. [SOURCE](https://docs.python.org/3/howto/regex.html)

####  Matching exercise

In [1]:
# Import the re module-- we'll use this throughout the notebook
import re

In [2]:
# check to see if phone is in text 


In [3]:
# yes, there is a match 


In [4]:
# not a match, return nothing 


In [5]:
# reset to phone 


In [6]:
# this is where the match expands, start and end



In [7]:
# what if phone appears more than once? 


In [8]:
# it will only match the first instance 


In [9]:
# to find the match returned, use group () method 


In [10]:
# you need to use the findall() method to get all the matches 


#### Regex Patterns 

In [11]:
# recall \d is for any digit 


In [12]:


# can also be written as the following 
# the {} represents number of times it repeats 



In [13]:

# groups are separated by parenthesis so you can use group(n) to isolate 
# order starts at 1 

In [14]:
# use | for or operator 



In [19]:
# use . for wildcard - match any character 

In [15]:
# notice how _am still counts 


In [16]:
# add more . for more wildcards 


In [17]:
# what if I want to grab one or more non whitespace that ends with at 
# use \S for non whitespace 
# use + for 1 or more times


In [23]:
# use ^ for start with 
# use $ for end with 

In [18]:
# start with a number  


In [19]:
# ends with a number  


In [26]:
# use brackets for grouping 
# use \w for alphanumeric 
# use + for occurs more than one time 

In [28]:
# to exclude use ^ and []  
# anything inside the [] will be excluded 

In [20]:
# exclude numbers 



In [21]:
# to put words back together use + sign 


In [22]:
#use case, removing puctuation 



In [23]:
# found list of pucntionation online" !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]+ ',sentence)


In [24]:
# join sentence 


In [25]:
# use () for multiple options 


#### An example

Let's say we want to create a function that validates newschool email addresses.

The function will return True if the address has the following features:<br>
`five letters` + `three digits` + `@newschool.edu`

Otherwise, it will return False.

In [26]:
# Without regular expressions


In [27]:
# With regular expressions


In [28]:
# Regular expressions become very important when the pattern
# increases in flexibility. For example, creating a generic
# email validator.


<br>

### General notes on regular expressions

Regular expressions use `metacharacters`:

`. ^ $ * + ? { } [ ] \ | ( )`

These metacharacters hold special meaning, giving regular expressions powerful flexibility compared to strings.

In [42]:
# Square brackets -> Sets of characters

# Letters
pattern = r"[a-z]" # All lowercase letters
pattern = r"[A-Z]" # All uppercase letters
pattern = r"[A-z]" # All letters (r"[A-Za-z]" also works)

# Digits
pattern = r"[0-9]" # All digits
pattern = r"[0-5]" # All digits from 0 to 5

# Custom set of characters
pattern = r"[A-z0-9]" # All letters and digits
pattern = r"[AEIOUaeiou]" # All vowels

# NOT in set
pattern = r"[^aeiou]" # Any symbol that is NOT a vowel
pattern = r"[^0-9]" # Any symbol that is NOT a digit

In [43]:
# Caret and dollar sign -> Beginning and end

# At the beginning
pattern = r"^[A-z]" # A letter at the beginning
pattern = r"^[0-9]" # A number at the beginning

# At the end
pattern = r"[A-z]$" # A letter at the end
pattern = r"[0-9]$" # A number at the end

# Defining both beginning and end
pattern = r"^[A-z]$" # Exactly one letter
pattern = r"^[0-9]$" # Exactly one number

In [44]:
# Curly brackets, asterisk, and question mark -> Repetition

# Curly brackets
pattern = r"[A-z]{3}" # Exactly three letters
pattern = r"[A-z]{3,5}" # Between three and five letters

# Asterisk
pattern = r"[A-z]*" # Zero or more letters

# Question mark
pattern = r"[A-z]?" # One or more letters

In [45]:
# Dot - Any character
pattern = r"." # Any character
pattern = r".{3}" # Three of any character
pattern = r"[A-z].?" # Any letter followed by one or more of any character

In [46]:
# Slash - Escape character
pattern = r"\." # A literal period symbol
pattern = r"[A-z]?\?" # One or more letters followed by a literal question mark
pattern = r"\[\]" # A literal set of square brackets

In [47]:
# Parentheses - Grouping
pattern = r"anna( banana)*" # 'anna' followed by zero or more 'banana's

In [48]:
# Bar - Logical or
pattern = r"(anna|banana)" # Either 'anna' or 'banana'
pattern = r"\.(com|edu|org$)" # Ends in a literal dot followed by 'com', 'edu', or 'org'
pattern = r"([aeiou]|[02468])" # Either a vowel or an even number