# Regular Expression

Regular expressions, or regexes, are written in a condensed formatting language. In general, you can think of a regular expression as a pattern which you give to a regex processor with some source data. The processor then parses that source data using that pattern, and returns chunks of text back to the a data scientist or programmer for further manipulation.

Reference: https://regex101.com/

In [None]:
# First we'll import the re module
import re

In [None]:
# The first, re.match() checks for a match that is at the beginning of the string and returns a boolean
# Boolean: True, False

In [None]:
# The second, re.search() checks for a match anywhere in the string, and returns a boolean.

# Let's create a text examples
text = "Ann Arbor will get snowy days soon."

# Now let's see if Ann Arbor will get snows

if re.search("snowy", text): # The first parameter is the pattern
    print("Wonderful")
else:
    print("Oh no!")

Wonderful


In [None]:
if re.match("snowy", text):
    print("Wonderful")
else:
    print("Oh no!")

Oh no!


In [None]:
# Segment a string

# Tokenizing: string is separated into substrings based on patterns
# First: re.findall() and re.split() function will parse the string for us and return chunks.

# Let's try
text2 = "Hey, Phoebe works diligently. Phoebe got a perfect presentation to the product team. Our teammate Phoebe is wonderful."

# Let's split this on all instances of Phoebe
re.split("Phoebe", text2)

['Hey, ',
 ' works diligently. ',
 ' got a perfect presentation to the product team. Our teammate ',
 ' is wonderful.']

Why is there an empty string at the beginning? Why all Derek suddenly disappeared?

In [None]:
# Count how many times I talked about Phoebe!

re.findall("Phoebe", text2)

len(re.findall("Phoebe", text2))

3

### re.compile() method

Syntax for re.compile() is 
* re.compile(pattern, text)

In [None]:
text2 = "Hey, Phoebe works diligently. Phoebe got a perfect presentation to the product team. Our teammate Phoebe is wonderful."
pattern = re.compile('Phoebe')
pattern.findall(text2)

['Phoebe', 'Phoebe', 'Phoebe']

In [None]:
pattern = re.compile('Derek')
pattern.findall('Hey Derek works diligently. Derek gets perfect presentation to the professor and a conference. Our student Derek is wonderful.')


['Derek', 'Derek', 'Derek']

## Anchor

Anchors specify the start and/or the end of the string that you are trying to match.
* ^ (caret character) means start (^ before a string)
* \$ (dollar sign) indicates the end of the string (put the \$ character after the string)
*

In [None]:
text2 = "Phoebe works diligently. Phoebe got a perfect presentation to the product team. Our teammate Phoebe is wonderful."
re.search("^Phoebe", text2)

# Notice that re.search() actually returned to us a new object, called re.Match object.
# If no match was found, return a None

<re.Match object; span=(0, 6), match='Phoebe'>

In [None]:
help(re.search)

Help on function search in module re:

search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.



## Patterns and Character Classes

In [None]:
# Create some grades that a student encountered for 4-year university life
grades = "ABBCABABABAAACDA"

# How many A's were in the grade list?
re.findall("A", grades)

['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A']

In [None]:
# How many B's were in the grade list?
re.findall("B", grades)

['B', 'B', 'B', 'B', 'B']

In [None]:
# If we wanted to count the number of A's and B's in the list.
# AB only looks for pattern AB sequentially, not A or B
re.findall("AB", grades)

['AB', 'AB', 'AB', 'AB']

In [None]:
re.findall("[AB]", grades) # This is called the set operator.

['A', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'A']

In [None]:
re.findall("[ABC]", grades) # This is called the set operator.

['A', 'B', 'B', 'C', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'A', 'C', 'A']

In [None]:
# If you wanna include all lower case letters we could use [a-z].
# If you wanna include all upper case letters we could use [A-Z]
# If you wanna include all numbers we could use [0-9]
re.findall("[A-Z]", grades)


['A',
 'B',
 'B',
 'C',
 'A',
 'B',
 'A',
 'B',
 'A',
 'B',
 'A',
 'A',
 'A',
 'C',
 'D',
 'A']

In [None]:
# What if we are really looking for patterns, BA and BC
re.findall("[B][AC]", grades)

['BC', 'BA', 'BA', 'BA']

In [None]:
re.findall("B[AC]", grades)

['BC', 'BA', 'BA', 'BA']

In [None]:
re.findall("BA|BC", grades) # pipe operator, which means OR

['BC', 'BA', 'BA', 'BA']

In [None]:
# Caret again!!
# We can use with the set operator (the bracket)
# Now we want to parse out all grades which are not A
re.findall("[^A]", grades)

['B', 'B', 'C', 'B', 'B', 'B', 'C', 'D']

In [None]:
# regex says that we want to match any value at the beginning of the string which is not an A
re.findall("^[^A]", grades) 

[]

## Quantifiers

Quantifiers are the number of times you want a pattern to show up to be matched. The pattern of quantifier is expressed as e{m,n}

If you provide only one value over m and n, the Regex will parse all the strings that occurs exactly the times you specify.

Metacharacters:
* * matches zero or more repetitions of the preceding regex
* + matches one or more repetitions of the preceding regex
* ? matches zero or one repetitions of the preceding regex

In [None]:
# Let's use the grade as example.
re.findall("A{2,20}", grades) # Find the occurrence of repeated A's, up to 20 times

['AAA']

In [None]:
re.findall("A{2}", grades)

['AA']

In [None]:
grades2 = "AAAAA"

In [None]:
re.findall("A{2}", grades2)

['AA', 'AA']

In [None]:
# We can also find a decreasing / increasing trend in a student's grades
re.findall("A{1,5}B{1,5}C{1,5}", grades)

['ABBC']

In [None]:
re.findall("C{1,5}B{1,5}A{1,5}", grades)

[]

In [None]:
test_string = "What stores are open during the (Labor Day)? (Lowe's, Walmart, Target, \
Kohl's and JCPenney) are, but (Costco) is closed Monday."

## Metacharacters

* \d - All digits
* \D - All non-digits
* \s - All whitespaces (space, tab, newline)
* \S - Not whitespace
* \w - Word character ([a-z], [A-Z], [0-9], _ )
* \W - Not a word character

In [None]:
# Find every word that ends with t.
pattern = re.compile(r'[t]$')
pattern.findall(test_string)

[]

In [None]:
re.findall('(\w*)[t]', test_string) # This will capture all words / partial word that ends with t. 
# Then, return the results from the beginning to the second last letter

['Wha', 's', '', 'Walmar', 'Targe', 'bu', 'Cos']

In [None]:
# how to make sure the t is indeed the last character of the word?
re.findall('(\w+[t])[\s|,]', test_string)

['What', 'Walmart', 'Target', 'but']

In [None]:
# re.finditer can find out every single object that satisfies your regex condition.
for item in re.finditer('(\w+[t])[\s|,]', test_string):
    # .group() method returns only the enclosed result by parenthesis
    print(item.group())

What 
Walmart,
Target,
but 


## Groups

In [None]:
# Group is useful when you only want to return a certain part of the pattern you captured.
# To group patterns together we'll use paranthesis ()

In [None]:
for item in re.finditer('(\w+[t])[\s|,]', test_string):
    # Now I use the .groups() function. It can give you every single group that you specified.
    print(item.groups())

('What',)
('Walmart',)
('Target',)
('but',)


In [None]:
for item in re.finditer('(\w+[t])[\s|,]', test_string):
    print(item.group(1))

What
Walmart
Target
but


## Raw string

* r'' pattern

In [None]:
print(r'In Son Zeng')

In Son Zeng


In [None]:
print(r'In Son Zeng\n')

In Son Zeng\n


In [None]:
print('In Son Zeng\n')

In Son Zeng



## Back-referencing



<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d164461f-c4b7-4df2-b711-3a9e5e978d4a' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>