# Introduction to Regular Expressions

In this module, we'll introduce regular expressions or "regex" as a way to perform more sophisticated string functions. Regex can search for patterns within a string rather than only searching for specific characters.

For example, we can write some regex code to extract an email address (e.g., " myname@email.com ") using the following pattern:

1. A break (e.g., " " or ","), followed by
2. Alpha-numeric characters, followed by
3. The "@" symbol, followed by
4. Alpha-numeric characters, followed by
5. A period ("."), followed by
6. Alpha-numeric characters, followed by
7. A break (e.g., " " or ",")

The regular expressions module in Python is called **re**. We import **re** as follows:

In [1]:
import re

#### Regex Functions

Regular expression (i.e., "regex") syntax can be quite complex. The purpose of this tutorial is not to help you become an expert. Rather, you will come away with a basic understanding of what regex can do.

Here are the basic functions you can use within the **re** module:

| Function | Description |
| :- | :- |
| re.match(pattern, string) | returns a match object of the first instance of the pattern within the string if the pattern is found at the beginning of the string |
| re.search(pattern, string) | returns a match object of the first instance of the pattern within the string |
| re.split(pattern, string) | returns a list with the string split by the pattern |
| re.findall(pattern, string) | returns a list of all occurences of the pattern within the string |
| re.sub(pattern, replacement_pattern, string) | returns the string replacing the pattern with the replacement pattern |

In [2]:
text = 'Here is an example sentence. This sentence is an example!'

In [3]:
test = re.search('sentence', text)
print(test)
print(test.start())
print(test.end())
print(test.group())

<re.Match object; span=(19, 27), match='sentence'>
19
27
sentence


In [4]:
test = re.split('sentence',text)
test

['Here is an example ', '. This ', ' is an example!']

In [5]:
test = re.findall('sentence', text)
count = len(test)
print(test)
print(count)

['sentence', 'sentence']
2


In [6]:
test = re.sub('sentence', 'phrase', text)
test

'Here is an example phrase. This phrase is an example!'

#### Special Patterns

The power of regex comes not in these built-in functions, but in our ability to create flexible patterns that capture specific text.

Here's a list of special patterns in regex:

| Special Pattern | Description |
| :-: | :- |
| \w+ | word |
| \d | digit |
| \s | space |
| \b | boundary |
| \S | NOT a space |
| \D | NOT a digit |
| .* | wildcard matches any character, symbol, or string |
| + | one or more occurrences |
| * | zero or more occurrences |
| \| | OR operator within a group |
| ? | previous character or group is optional |
| {} | exactly the specified number of occurrences |
| [a-z] | lower case character class |
| [A-Z] | upper case character class |
| [a-zA-Z] | lower case or upper case character class |
| [0-9] | digit character class |
| [a-zA-Z0-9] | lower case or upper case or digit character class |

To use these special patterns, the best practice is to use Python's **raw string** by include an "r" at the beginning of the string pattern.

In [7]:
text = "You miss 100% of the shots you don't take - Wayne Gretzky - Michael Scott"

In [8]:
test = re.findall(r'\w+',text)
print(test)

['You', 'miss', '100', 'of', 'the', 'shots', 'you', 'don', 't', 'take', 'Wayne', 'Gretzky', 'Michael', 'Scott']


In [9]:
test = re.findall(r'[A-Z][a-z]+',text)
test

['You', 'Wayne', 'Gretzky', 'Michael', 'Scott']

In [10]:
test = re.findall(r'\b[a-z][a-z][a-z][a-z]\b',text)
test

['miss', 'take']

In [11]:
test = re.findall(r'\b[a-z]{4}\b',text)
test

['miss', 'take']

In [12]:
test = re.sub(r'\b[a-z]{4}\b','XXXX',text)
test

"You XXXX 100% of the shots you don't XXXX - Wayne Gretzky - Michael Scott"

In [13]:
#test = re.findall(r'\d',text)
test = re.findall(r'\d+',text)
test

['100']

In [14]:
test = re.findall(r'[0-9]+',text)
test

['100']

In [15]:
test = re.findall(r'shots .* take',text)
test

["shots you don't take"]

#### Regex Groups

Regex groups are helpful when you want to *capture* matches within a string. We create groups using the () symbols. We can then refer to each captured group using a backslash followed by a number representing each group starting at 1, then 2, etc. (e.g., "\\3" represents the third captured group).

Capturing groups can be really useful if you want to replace a captured group with specific text. An example will make this more clear. Say we have a list of people in the format:

    LastName, FirstName - Job Position

In [16]:
text = """Scott, Michael - Regional Manager
Schrute, Dwight - Salesperson
Halpert, Jim - Salesperson
Beesly, Pam - Receptionist
Howard, Ryan - Temp
Bernard, Andy - Salesperson
Hudson, Stanley - Salesperson
Malone, Kevin - Accountant
Palmer, Meredith - Supplier Relations
Martin, Angela - Accountant
Martinez, Oscar - Accountant
Lapin, Phyllis - Salesperson
Kapoor, Kelly - Customer Service
Flenderson, Toby - Human Resources Representative
Bratton, Creed - Quality Control
Philbin, Darryl - Foreman
"""
print(text)

Scott, Michael - Regional Manager
Schrute, Dwight - Salesperson
Halpert, Jim - Salesperson
Beesly, Pam - Receptionist
Howard, Ryan - Temp
Bernard, Andy - Salesperson
Hudson, Stanley - Salesperson
Malone, Kevin - Accountant
Palmer, Meredith - Supplier Relations
Martin, Angela - Accountant
Martinez, Oscar - Accountant
Lapin, Phyllis - Salesperson
Kapoor, Kelly - Customer Service
Flenderson, Toby - Human Resources Representative
Bratton, Creed - Quality Control
Philbin, Darryl - Foreman



We can write some regex code that will capture the LastName and FirstName groups of characters and replace them with whatever we want.

Let's just start out by finding all "LastName, FirstName" matches before we worry about adding the complexity of groups.

In [17]:
names = re.findall(r'[A-Z][a-z]+,\s[A-Z][a-z]+', text)
names

['Scott, Michael',
 'Schrute, Dwight',
 'Halpert, Jim',
 'Beesly, Pam',
 'Howard, Ryan',
 'Bernard, Andy',
 'Hudson, Stanley',
 'Malone, Kevin',
 'Palmer, Meredith',
 'Martin, Angela',
 'Martinez, Oscar',
 'Lapin, Phyllis',
 'Kapoor, Kelly',
 'Flenderson, Toby',
 'Bratton, Creed',
 'Philbin, Darryl']

Now let's adjust this code slightly to capture these groups by adding parentheses

In [18]:
names = re.findall(r'([A-Z][a-z]+),\s([A-Z][a-z]+)', text)
names

[('Scott', 'Michael'),
 ('Schrute', 'Dwight'),
 ('Halpert', 'Jim'),
 ('Beesly', 'Pam'),
 ('Howard', 'Ryan'),
 ('Bernard', 'Andy'),
 ('Hudson', 'Stanley'),
 ('Malone', 'Kevin'),
 ('Palmer', 'Meredith'),
 ('Martin', 'Angela'),
 ('Martinez', 'Oscar'),
 ('Lapin', 'Phyllis'),
 ('Kapoor', 'Kelly'),
 ('Flenderson', 'Toby'),
 ('Bratton', 'Creed'),
 ('Philbin', 'Darryl')]

We can now refer to these groups in our pattern. For example, let's replace our text so that it is in the format:

    FirstName LastName - Job Position
    
In the code below, the "\1" refers to the first captured group (i.e., the last name). The "\2" refers to the second captured group (i.e., the first name).

In [19]:
new_text = re.sub(r'([A-Z][a-z]+),\s([A-Z][a-z]+)', r'\2 \1', text)
print(new_text)

Michael Scott - Regional Manager
Dwight Schrute - Salesperson
Jim Halpert - Salesperson
Pam Beesly - Receptionist
Ryan Howard - Temp
Andy Bernard - Salesperson
Stanley Hudson - Salesperson
Kevin Malone - Accountant
Meredith Palmer - Supplier Relations
Angela Martin - Accountant
Oscar Martinez - Accountant
Phyllis Lapin - Salesperson
Kelly Kapoor - Customer Service
Toby Flenderson - Human Resources Representative
Creed Bratton - Quality Control
Darryl Philbin - Foreman



#### Exercise

You are given the following text string:

In [20]:
text = 'The student was taking ENG 101, ACC 200, FIN 101, and MTH 114.'

Do the following:

1. Obtain a list of all words (including numbers as "words") in the string.
2. Create a variable called `new_text` equal to `text` but replacing all numbers in the string with the word 'NUMBER'.
3. Obtain a list of all courses listed in the string where each course is 3 uppercase letters followed by a space and followed by three digits. Write your code without the "{}" symbols.
4. Repeat #3 but use the "{}" symbols in your pattern to repeat characters.
5. Use the .* symbols to obtain a list of all text that starts with 'student' and ends with '200' in the string.

#### Solution for # 1

In [21]:
test = re.findall(r'\w+', text)
print(test)

['The', 'student', 'was', 'taking', 'ENG', '101', 'ACC', '200', 'FIN', '101', 'and', 'MTH', '114']


#### Solution for # 2

In [22]:
new_text = re.sub(r'\d+', 'NUMBER', text)
new_text

'The student was taking ENG NUMBER, ACC NUMBER, FIN NUMBER, and MTH NUMBER.'

#### Solution for # 3

In [23]:
re.findall(r'[A-Z][A-Z][A-Z]\s[0-9][0-9][0-9]', text)

['ENG 101', 'ACC 200', 'FIN 101', 'MTH 114']

#### Solution for # 4

In [24]:
re.findall(r'[A-Z]{3}\s[0-9]{3}', text)

['ENG 101', 'ACC 200', 'FIN 101', 'MTH 114']

#### Solution for # 5

In [25]:
re.findall(r'student .* 200', text)

['student was taking ENG 101, ACC 200']

#### Exercise - Regex Groups

You are given the following text string with phone numbers in the format XXX-XXX-XXXX:

In [26]:
text = """801-123-4567
706-124-8765
714-321-9876"""

Write some regex code that will capture three groups:

1. The area code
2. The first three digits
3. The last four digits

Use re.sub to replace the string so that the phone numbers are in the format (XXX)XXX-XXXX:

    (801)123-4567
    (706)124-8765
    (714)321-9876

#### Solution - Regex Groups

In [27]:
new_text = re.sub(r'([0-9]{3})-([0-9]{3})-([0-9]{4})', r'(\1)\2-\3', text)
print(new_text)

(801)123-4567
(706)124-8765
(714)321-9876
