**SA433 &#x25aa; Data Wrangling and Visualization &#x25aa; Fall 2025**

# Lesson 21. Regular Expressions

## Overview

- A **regular expression** (**regex** for short) is a pattern used to match character combinations in strings


- They are extremely useful, but they can be a bit difficult to use


- First, we'll spend some time learning the basics of regular expressions 


- Then, we'll see how regular expressions can be used in data wrangling

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## An introduction to regular expressions

### Regular expressions in Python

- Regular expressions are part of the Python Standard Library, in the `re` module
    - [Documentation for the `re` module](https://docs.python.org/3/library/re.html)

- Let's begin by importing `re`:

In [1]:
import re

- To perform a match with regular expressions, we can use the `re.search()` method 

- `re.search(PATTERN, STRING)` scans through `STRING` looking for the *first* location where the regular expression `PATTERN` produces a match

    - To make your lives easier, make sure to use a **raw string** for `PATTERN`:

        ```python
        r'This is a raw string'
        ```
        <br>
        
        - Regular expressions often require a lot of special characters, like backslashes (`\`)
        - Using raw strings prevents the special characters from taking on their usual behaviors<br><br>

    - `re.search()` returns a **match object** if it finds a match, or `None` if no position in the string matches the pattern 

- For example, to find where the substring `an` matches in the string `banana`, we can write: 

In [2]:
print(re.search(r'an', 'banana'))

<re.Match object; span=(1, 3), match='an'>


- Similarly, we can find where `an` matches in `apple` like this:

In [3]:
print(re.search(r'an', 'apple'))

None


- To help us learn about regular expressions, we'll use the function `view_regex_search()` defined below, which lets us see visually how a regular expression matches with a string or a list of strings:

In [4]:
from IPython.display import display, HTML

def view_regex_search(regex, strings):
    """
    View the results of `re.search()` visually in a Jupyter notebook.
    
    - regex: string containing regular expression
    - strings: string or list of strings
    """
    if isinstance(strings, str):
        strings = [strings]
        
    for string in strings:
        match = re.search(regex, string)
        if match:
            l, r = match.span()

            string_left = string[:l]
            string_match = string[l:r]
            string_right = string[r:]

            string_html = (
                '<pre>'
                + string_left 
                + '<span style="background-color:lightgrey;'
                + 'border: 1px solid gray;border-radius:2px">'
                + string_match
                + '</span>'
                + string_right
                + '</pre>'
            )

        else:
            string_html = '<pre>' + string + '</pre>'
            
        display(HTML(string_html)) 

### Basic matches

- The simplest patterns match exact strings:

In [None]:
view_regex_search(..., ['apple', 'banana', 'pear'])

In [5]:
# Solution
view_regex_search(r'an', ['apple', 'banana', 'pear'])

- The next step up in complexity is `.`, which matches any character except a newline:

In [None]:
view_regex_search(..., ['apple', 'banana', 'pear'])

In [6]:
# Solution
view_regex_search(r'.a.', ['apple', 'banana', 'pear'])

### Escaping (aka backslash despair)

- Wait a minute... if `.` matches any character, how do we match the character `.` specifically? ü§î

- We can **escape** a special character like `.` by putting a backslash `\` in front of it to tell the regular expression that we want to match it exactly, and not use its special behavior
    - We'll see more special characters in a moment

- For example:

In [None]:
view_regex_search(..., ['abc', 'a.c', 'bef'])

In [7]:
# Solution
view_regex_search(r'\.', ['abc', 'a.c', 'bef'])

- Hold on... if we use backlashes to escape special characters, how do we match the character `\` specifically?


- We use *2* backslashes, like this ü§¶‚Äç‚ôÄÔ∏è:

In [None]:
view_regex_search(..., ["a\\b", "abc"])

In [8]:
# Solution
view_regex_search(r'\\', ["a\\b", "abc"])

### Anchors

- It's often useful to **anchor** the regular expression so that it matches from the start or end of the string

- We can use:
    - `^` to match the start of the string
    - `$` to match the end of the string

In [None]:
view_regex_search(..., ['apple', 'banana', 'pear'])

In [9]:
# Solution
view_regex_search(r'^a', ['apple', 'banana', 'pear'])

In [None]:
view_regex_search(..., ['apple', 'banana', 'pear'])

In [10]:
# Solution
view_regex_search(r'a$', ['apple', 'banana', 'pear'])

‚ùì **Exercise 1.** Write a regular expression that matches only the second string in this list: 

```python
['apple pie', 'apple', 'apple cake']
```

In [None]:
view_regex_search(..., ['apple pie', 'apple', 'apple cake'])

In [11]:
# Solution
view_regex_search(r'^apple$', ['apple pie', 'apple', 'apple cake'])

‚ùì **Exercise 2.** Write a regular expression that matches only the first string in this list:

```python
['$^$', 'ab$^$sfas']
```

In [None]:
view_regex_search(..., ['$^$', 'ab$^$sfas'])

In [12]:
# Solution
view_regex_search(r'^\$\^\$', ['$^$', 'ab$^$sfas'])

### Character classes

- There are a variety of special characters that distinguish between different types of characters

- For example:
    - `\d` matches any digit
    - `\s` matches any whitespace (e.g., space, tab, newline)
    - `[abc]` matches `a`, `b`, or `c`
    - `[^abc]` matches anything except `a`, `b`, or `c`

In [None]:
view_regex_search(..., ['apple', '2 apples', '3   apples'])

In [13]:
# Solution
view_regex_search(r'\d', ['apple', '2 apples', '3   apples'])

In [None]:
view_regex_search(..., ['apple', '2 apples', '3   apples'])

In [14]:
# Solution
view_regex_search(r'\s', ['apple', '2 apples', '3   apples'])

In [None]:
view_regex_search(..., ['apple', '2 apples', '3   apples'])

In [15]:
# Solution
view_regex_search(r'[^23]', ['apple', '2 apples', '3   apples'])

### Alternation

- You can also match one or more alternative patterns with `|`


- To clarify how things are grouped, we can use parentheses, just like with mathematical expressions


- For example, to match "gray" or "grey", we can write:

In [None]:
view_regex_search(..., ['grey', 'gray'])

In [16]:
# Solution
view_regex_search(r'gr(e|a)y', ['grey', 'gray'])

‚ùóÔ∏è For Exercises 3-5, use the list of random words below.

In [17]:
random_word_list = ['bring', 'fierce', 'graded', 'true', 'obvious',
                    'unlock', 'eggs', 'tremble', 'screed', 'advertise']

‚ùìÔ∏è **Exercise 3.** Write a regular expression that matches all words that start with a vowel.

In [18]:
# Solution
view_regex_search(r'^[aeiou]', random_word_list)

‚ùì **Exercise 4.** Write a regular expression that matches all words  that end with `ed`, but not with `eed`.

In [19]:
# Solution
view_regex_search(r'[^e]ed$', random_word_list)

‚ùì **Exercise 5.** Write a regular expression that matches all words that end with `ing` or `ise`.

In [20]:
# Solution
view_regex_search(r'i(ng|se)$', random_word_list)

### Repetition

- So far, the regular expressions we've written only match a pattern once

- We can control *how many times* a pattern matches with the following special characters:
    - `?` matches 0 or 1
    - `+` matches 1 or more
    - `*` matches 0 or more

- For example:

In [None]:
view_regex_search(
    ..., 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [21]:
# Solution
view_regex_search(
    r'CC?', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [None]:
view_regex_search(
    ..., 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [22]:
# Solution
view_regex_search(
    r'CC+', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [None]:
view_regex_search(
    ..., 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [23]:
# Solution
view_regex_search(
    r'C[LX]+', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

- We can also specify the number of matches precisely:
    - `{n}`: matches exactly n
    - `{n,}`: matches n or more
    - `{,m}`: matches at most m
    - `{n,m}`: matches between n and m

In [None]:
view_regex_search(
    ..., 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [24]:
# Solution
view_regex_search(
    r'C{2}', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [None]:
view_regex_search(
    ..., 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [25]:
# Solution
view_regex_search(
    r'C{2,}', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

- Note from the above examples that these matches are **greedy** by default: they will match the longest string possible


- We can make these matches **lazy** by putting a `?` after them, like this:

In [None]:
view_regex_search(
    ..., 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [26]:
# Solution
view_regex_search(
    r'C{2,}?', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [None]:
view_regex_search(
    ...', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

In [27]:
# Solution
view_regex_search(
    r'C[LX]+?', 
    "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
)

‚ùóÔ∏è For Exercises 6 and 7, use `random_word_list` defined above to check your work.

‚ùì **Exercise 6.** Write a regular expression that matches all words that start with two consonants. 

In [28]:
# Solution
view_regex_search(r'^[^aeiou]{2}', random_word_list)

‚ùì **Exercise 7.** Write a regular expression that matches all words that have three or more vowels in a row.

In [29]:
# Solution
view_regex_search(r'[aeiou]{3,}', random_word_list)

### Grouping and backreferences

- We saw above that we can use parentheses `( )` to clarify how things are grouped

- By default, parentheses also define a numbered *capturing group* (number  1, 2, etc.)
    - A **capturing group** stores the part of the string matched by the part of the regular expression inside thee parentheses

- You can refer to text  previously matched by a capturing group with **backreferences** of the form `\1`,`\2`, etc.


- For example, we can match words with *repeated pairs* of letters as follows:

In [None]:
view_regex_search(..., ['apple', 'banana', 'pear', 'coconut'])

In [30]:
# Solution
view_regex_search(r'(..)\1', ['apple', 'banana', 'pear', 'coconut'])

### Lookahead and lookbehind

- Parentheses don't always define a numbered capturing group... 


- Sometimes, we want to match something only if it is followed by something else
    - For example, maybe we want to match `data` only if it is followed by `wrangling` or `visualization`

- We can use **lookahead** and **lookbehind** to perform these kinds of matches:
    - **Positive lookahead.** `x(?=y)` matches `x` only if `x` is followed by `y`
    - **Negative lookahead.** `x(?!y)` matches `x` only if `x` is *not* followed by `y`
    - **Positive lookbehind.** `(?<=y)x` matches `x` only if `x` is preceded by `y`
    - **Negative lookbehind.** `(?<!y)x` matches `x` only if `x` is *not* preceded by `y`

- For example, to match `data` only if its followed by `wrangling` or `visualization`:

In [None]:
view_regex_search(
    ...,
    'This data requires a lot of data wrangling.'
)

In [31]:
# Solution
view_regex_search(
    r'data(?=\s+(wrangling|visualization))',
    'This data requires a lot of data wrangling.'
)

‚ùì **Exercise 8.** Write a regular expression that matches the word `wrangling` only if it is *not* preceded by the word `data`. Use the following sentence to test your work:

```
Data wrangling is not the only kind of wrangling.
```

In [32]:
# Solution
view_regex_search(
    r'(?<![Dd]ata)\s?wrangling',
    'Data wrangling is not the only kind of wrangling.'
)

### We've only scratched the surface...

- There are *tons* of tutorials, references, cheat sheets, etc. on regular expressions out there


- [Here's an introduction to regular expressions from the Python documentation](https://docs.python.org/3/howto/regex.html)


- [Here is a *very* useful website that helps you build and decipher regular expressions](https://regexr.com/)

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Using regular expressions with Pandas

- Regular expressions are particularly useful when we want to extract data from a descriptive source


- Let's look at a small example


- Let's start by importing Pandas:

In [33]:
import pandas as pd

# Increase column width in this notebook
pd.set_option('display.max_colwidth', None)

- Now, let's read in this small dataset on roses, with two columns: `NAME` and `BLOOM`: 

In [34]:
df = pd.read_csv('data/roses.csv')
df

Unnamed: 0,NAME,BLOOM
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season."
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season."
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season."
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season."
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds."
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season."


### Filtering rows 

- We can use `.query()` with `.str.contains()` to filter rows whose column value matches a regular expression
    - [Documentation for `.str.contains()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html)

- For example, we can find all rows whose `BLOOM` value contains the word "white", either capitalized or not, as follows: 

In [35]:
# Solution
df.query('BLOOM.str.contains(r"[Ww]hite")')

Unnamed: 0,NAME,BLOOM
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season."
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds."
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season."


### Extracting data

- Suppose we want to extract the information about the number of petals in the `BLOOM` column

- We see that this information comes in a variety of forms:
    1. `XX to XX petals`
    2. `XX-XX petals`
    3. `XX petals`

- How can we extract this information?

- We can use the `.str.findall()` Series method to find *all* occurrences of a pattern or regular expression
    - `Series.str.findall(PATTERN)` returns a Series of lists of strings with all matches of `PATTERN`
    - [Documentation for `.str.findall()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.findall.html)

- There are other Series methods that do similar things:
    - `.str.extract()` [[Documentation]](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html)
    - `.str.extractall()` [[Documentation]](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extractall.html)

- Let's create a new DataFrame called `extract_df`, using regular expressions to create additional columns containing the number of petals, based on the 3 forms above 

In [36]:
# Solution
extract_df = (
    df
    .assign(
        PETALS1=lambda x: x['BLOOM'].str.findall(r'(\d{2}) to (\d{2}) petals'),
        PETALS2=lambda x: x['BLOOM'].str.findall(r'(\d{2})-(\d{2}) petals'),
        PETALS3=lambda x: x['BLOOM'].str.findall(r'(?<! to )(?<!-)\s?(\d{2}) petals'),
    )
)

extract_df

Unnamed: 0,NAME,BLOOM,PETALS1,PETALS2,PETALS3
0,Evert van Dijk,"Carmine-pink, salmon-pink streaks, stripes, flecks. Warm pink, clear carmine pink, rose pink shaded salmon. Mild fragrance. Large, very double, in small clusters, high-centered bloom form. Blooms in flushes throughout the season.",[],[],[]
1,Every Good Gift,"Red. Flowers velvety red. Moderate fragrance. Average diameter 4"". Medium-large, full (26-40 petals), borne mostly solitary bloom form. Blooms in flushes throughout the season.",[],"[(26, 40)]",[]
2,Evghenya,"Orange-pink. 75 petals. Large, very double bloom form. Blooms in flushes throughout the season.",[],[],[75]
3,Evita,"White or white blend. None to mild fragrance. 35 petals. Large, full (26-40 petals), high-centered bloom form. Blooms in flushes throughout the season.",[],"[(26, 40)]",[35]
4,Evrathin,"Light pink. [Deep pink.] Outer petals white. Expand rarely. Mild fragrance. 35 to 40 petals. Average diameter 2.5"". Medium, double (17-25 petals), full (26-40 petals), cluster-flowered, in small clusters bloom form. Prolific, once-blooming spring or summer. Glandular sepals, leafy sepals, long sepals buds.","[(35, 40)]","[(17, 25), (26, 40)]",[40]
5,Evita 2,"White, blush shading. Mild, wild rose fragrance. 20 to 25 petals. Average diameter 1.25"". Small, very double, cluster-flowered bloom form. Blooms in flushes throughout the season.","[(20, 25)]",[],[25]


- Not too bad!


- Suppose that we want to distill this information into a single number


- Perhaps we can compute the average of these values?


- Since we're dealing with Series of lists of tuples üò≥, this is a bit complicated


- One way to do this would be to use the `.apply()` method

### The `.apply()` method

- The `.apply()` method lets you apply a function to every row or every column of a DataFrame

- Usage: `.apply(FUNCTION, axis=AXIS)`
    - If `AXIS='index'`, then `FUNCTION` is applied to each column (yes, this feels backwards!)
    - If `AXIS='columns'`, then `FUNCTION` is applied to each row (yes, this feels backwards!)
    - `FUNCTION` should take one argument, the row or column of the DataFrame, depending on the value of `AXIS`

- ‚ö†Ô∏è **Warning.** `.apply()` is very flexible, but it is also very slow! 
    - You should think carefully about whether there is a built-in Pandas Series/DataFrame method you can use instead, like the ones we discussed in Lesson 15 for numerical computations and Lesson 20 for string operations

- Let's write a function `average_petals()` that takes a row `r` from `extract_df` as input, and
    - Puts all the petal values from `PETALS1`, `PETALS2`, and `PETALS3` into a single list
    - Take the average of the petal values in the newly created list

In [37]:
import numpy as np

def average_petals(r):
    all_values = []
    
    for t in r['PETALS1']:
        all_values.append(t[0])
        all_values.append(t[1])
        
    for t in r['PETALS2']:
        all_values.append(t[0])
        all_values.append(t[1])
        
    for i in r['PETALS3']:
        all_values.append(i)
        
    all_values = [int(x) for x in all_values]
    
    if len(all_values) > 0:
        return np.mean(all_values)
    else:
        return np.nan

- Let's test this out: 

In [38]:
# Solution
extract_df.apply(average_petals, axis='columns')

0          NaN
1    33.000000
2    75.000000
3    33.666667
4    31.857143
5    23.333333
dtype: float64

- We can add this column to `extract_df` and clean things up a little bit: 

In [39]:
# Solution
(
    extract_df
    .assign(
        AVG_PETALS=lambda x: x.apply(average_petals, axis='columns')
    )
    .drop(
        columns=['BLOOM', 'PETALS1', 'PETALS2', 'PETALS3']
    )
)

Unnamed: 0,NAME,AVG_PETALS
0,Evert van Dijk,
1,Every Good Gift,33.0
2,Evghenya,75.0
3,Evita,33.666667
4,Evrathin,31.857143
5,Evita 2,23.333333


<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Notes and sources

- Lesson and exercises inspired by 
    - Chapter 14 of [R for Data Science](https://r4ds.had.co.nz/)
    - [This article on re-thought.com](https://re-thought.com/python-regex-example-for-pattern-2-digits-to-2-digits-26-to-40/)