# 2.4 Regular Expressions

Regular expressions, or "regex" for short, is a special syntax for searching for strings that meets a specified pattern. It's a great tool to filter and sort through text when you want to match patterns rather than a hard coded string or strings. 

There are loads of options for the syntax so it's best to just jump in and get started with some examples.

In [1]:
import re

## Raw Strings

Python recognises certain characters to have a special meaning, for example, \n in python is used to indicate a new line. However, sometimes these codes that python recognises to have certain meanings appear in our strings and we want to tell python that a \n in our text is a literal \n, rather than meaning a new line. 

We can use the 'r' character before strings to indicate to python that our text is what is known as a "raw string".

In [None]:
# print text without using raw string indicator
my_folder = "C:\desktop\notes"
print(my_folder)

C:\desktop
otes


See how the python interprets the \n to mean a new line! Now let's try it as a raw string...

In [None]:
# include raw string indicator
my_folder = r"C:\desktop\notes"
print(my_folder)

C:\desktop\notes


The folder path is now printed as we wanted. This is important to keep in mind when working with regular expressions as we'll want to make sure we are using raw strings when working with special characters. It's also just a good habit to get into when working with strings and regular expressions so you don't get caught out!

## re.search

re.search is a function which allows us to check if a certain pattern is in a string. It uses the logic re.search("pattern to find", "string to find it in"). It will return the pattern if it is found in the string, or else it will return None if the pattern is not found.

In [6]:
# re.search() looks for the first location where the regex pattern matches the string
result_search = re.search("pattern", r"string containing the pattern")
print(result_search)

<re.Match object; span=(22, 29), match='pattern'>


In [7]:
print(result_search.group()) # returns just the matching pattern

pattern


In [8]:
result_search = re.search("pattern",r"the phrase to find isn't in this string")
print(result_search) # returns None

None


## re.sub

re.sub allows us to find certain text and replace it. It uses the logic re.sub("pattern to find", "replacement text", "string").

In [10]:
string = r"sara was able to help me find the items i needed quickly"

In [None]:
# this replaces *all* matches of the pattern in the string
new_string = re.sub(r"sara", r"sarah", string) # replace the incorrect spelling of sarah
print(new_string)

sarah was able to help me find the items i needed quickly


## Regex Syntax

The real power of regex is being able to leverage the syntax to create more complex searches/replacements.

In [12]:
customer_reviews = ['sam was a great help to me in the store', 
                    'the cashier was very rude to me, I think her name was eleanor', 
                    'amazing work from sadeen!', 
                    'sarah was able to help me find the items i needed quickly', 
                    'lucy is such a great addition to the team', 
                    'great service from sara she found me what i wanted'
                   ]

**Find only sarah's reviews but account for the spelling of sara**

In [13]:
sarahs_reviews = []
pattern_to_find = r"sarah?" 
# the ? after r means it is an optional character to match, so our search will look for sarah and sara
# ? makes the preceding character optional

In [14]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        sarahs_reviews.append(string)
        
# re.search(pattern_to_find, string, re.IGNORECASE): -> re.IGNORECASE makes it case-insensitive

In [15]:
print(sarahs_reviews)

['sarah was able to help me find the items i needed quickly', 'great service from sara she found me what i wanted']


**Find reviews that start with the letter a**

In [19]:
a_reviews = []
pattern_to_find = r"^a" # the ^ operator to indicates the start of a string

In [20]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        a_reviews.append(string)

In [22]:
print(a_reviews)

['amazing work from sadeen!']


**Find reviews that end with the letter y**

In [16]:
y_reviews = []
pattern_to_find = r"y$" # the $ operator to indicate the end of a string

In [17]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        y_reviews.append(string)

In [18]:
print(y_reviews)

['sarah was able to help me find the items i needed quickly']


**Find reviews that contain the words needed or wanted**

In [19]:
needwant_reviews = []
pattern_to_find = r"(need|want)ed" # the pipe operator | can be used to mean OR

In [20]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        needwant_reviews.append(string)

In [21]:
print(needwant_reviews)

['sarah was able to help me find the items i needed quickly', 'great service from sara she found me what i wanted']


**Remove anything from the review that isn't a word or a space (i.e. remove punctuation)**

In [26]:
no_punct_reviews = []

pattern_to_find = r"[^\w\s]" 
# [^\w\s] = match anything that is NOT:
#    \w = word characters (letters, numbers, underscore)
#    \s = whitespace (space, tab, newline)
# So this finds all punctuation and special characters

In [27]:
for string in customer_reviews:
    no_punct_string = re.sub(pattern_to_find, "", string)
    no_punct_reviews.append(no_punct_string)

In [28]:
print(no_punct_reviews)

['sam was a great help to me in the store', 'the cashier was very rude to me I think her name was eleanor', 'amazing work from sadeen', 'sarah was able to help me find the items i needed quickly', 'lucy is such a great addition to the team', 'great service from sara she found me what i wanted']


## Other Examples

In [None]:
# Find a number in a sentence

text = "There are 15 apples on the table."
pattern = r"\d+"  # \d+ means one or more digits

match = re.search(pattern, text)
if match:
    print("Found number:", match.group())
else:
    print("No number found.")

Found number: 15


In [None]:
# Extract emails with or without dots in names

emails = ["jane.doe@gmail.com", "janedoe@gmail.com", "john@example.com"]
pattern = r"jane\.?doe@gmail\.com"  # .? makes the dot optional

for email in emails:
    if re.search(pattern, email):
        print("Found:", email)

Found: jane.doe@gmail.com
Found: janedoe@gmail.com


In [None]:
# Filter usernames that end with a digit

usernames = ["alice", "bob99", "carol", "daniel1", "eve_"]

pattern = r"\d$"  # \d means a digit, $ means it must be the last character

numeric_ending = [u for u in usernames if re.search(pattern, u)]
print(numeric_ending) 

In [None]:
# Check if someone said yes or no

responses = ["Yes, I would love that!", "No way!", "Maybe", "Sure", "no thanks"]

pattern = r"(yes|no)"

matched = [r for r in responses if re.search(pattern, r, re.IGNORECASE)]
print(matched)

In [None]:
# Clean tweets

tweets = ["Love this! ❤️", "Ugh... Mondays 😒", "Working hard, or hardly working?"]

cleaned = [re.sub(r"[^\w\s]", "", t) for t in tweets]
print(cleaned)  # Output: ['Love this', 'Ugh Mondays', 'Working hard or hardly working']

## What I Learned

- Regex helps **search, filter, replace, and clean text efficiently**.

- Always use r"..." (raw strings) when writing regex patterns to avoid issues with escape characters (\n, \t, etc.).

- Use anchors (^, $) to control match positions.


Useful functions

- re.search(pattern, string) → Finds first match anywhere in the string.

- re.sub(pattern, replacement, string) → Replaces all matches with new text.

- re.findall(pattern, string) → Returns all non-overlapping matches as a list.

- re.match(pattern, string) → Matches only at the beginning of the string.

