# Regular Expressions in Python

## Overview

This notebook introduces Regular Expressions (Regex), a powerful pattern-matching tool essential for text processing in NLP and data engineering pipelines. Regex provides a concise and flexible way to search, match, and manipulate text patterns, making it indispensable for data cleaning, preprocessing, feature extraction, and validation tasks. We'll learn the syntax, metacharacters, and practical applications of regex in machine learning and deep learning workflows, with a focus on real-world use cases in text preprocessing and data wrangling.

## Objectives

- Understand the importance of Regex in ML and DL pipelines.
- Learn common characters and metacharacters used in regex patterns
- Write regex patterns to match and replace text in strings
- Read regex patterns and tell what they do
- Translate natural language into regex patterns

A concise explanation of regex and their use cases (text searching, data validation, text manipulation).

## Outline

1. **Introduction to Regular Expressions** - What regex is and why it matters
2. **Core Applications** - Data scraping, wrangling, and validation
3. **Use Cases in ML/DL Pipelines**:
   - Data Cleaning & Preprocessing
   - Feature Extraction
   - Weak Supervision & Labeling
4. **The `regex` Library in Python** - Introduction to the regex module
5. **Character Classes** - Using `[ ]` to match character sets
6. **Metacharacters** - Special characters with specific meanings
7. **Quantifiers** - Matching multiple occurrences (`*`, `+`, `?`, `{}`)
8. **Anchors** - Matching positions (`^`, `$`, `\b`)
9. **Groups and Capturing** - Extracting specific parts of matches
10. **Practical Examples** - Real-world pattern matching scenarios
11. **Tools and Resources** - Regex editors and learning resources

# Regular Expressions

> A **regular expression** (shortened as regex or regexp), sometimes referred to as a rational expression,is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. -- [Wikipedia](https://en.wikipedia.org/wiki/Regular_expression)

Example 1:

![Regular expression](https://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Antony08.gif/474px-Antony08.gif)

Blue highlights show the match results of the regular expression pattern: `/r[aeiou]+/g` (lowercase `r` followed by one or more lowercase vowels).


**Try it:** in VS Code (and other IDEs): `Ctrl+Shift+H` (for the open tab) and `Ctrl+Shift+F` (for the workspace) then click on the ("dot star" thing).

Example 2:

In [1]:
text = """
Pushups 30 reps 3 sets
5 reps 2 sets Pullups
2 Sets 15 Reps One-leg Squats
4 sets 8 reps 22.5 lbs Dumbbell Rows
4 sets 8 reps 15.25kg Dumbbell Rows
"""

This `text` is an exercise log where each line contains a type of exercise, the number of sessions (sets), and repetitions (reps), but they are unorganized and inconsistent. The name, the units, the order of (reps then sets or sets then reps) and so on.

## Core Applications

Now that you know the syntax, you can use regex for much more than just finding words. It is the industry standard for:

* **Data Scraping & Wrangling:** Extracting specific information from websites or cleaning up messy datasets.
* **Data Validation:** Ensuring user input (like emails or phone numbers) follows the correct format.

### Use Cases for Regex in ML/DL

Regular Expressions (Regex) are an indispensable tool in the Machine Learning (ML) and Deep Learning (DL) pipeline, particularly within **Natural Language Processing (NLP)** and **Data Engineering**.

While deep learning models are great at learning patterns, they are computationally expensive and require clean data. Regex serves as a lightweight, rule-based filter to clean, structure, and extract data *before* it ever touches a neural network.

Here are the primary use cases for Regex in an ML/DL workflow, organized by pipeline stage.


---


#### 1. Data Cleaning & Preprocessing (NLP)

This is the most common use case. Raw text data (from web scrapes, social media, or documents) is notoriously "noisy." You must sanitize it to prevent your model from learning irrelevant patterns.

**Removing Noise:** Stripping out HTML tags, URLs, hashtags, or non-ASCII characters that might confuse a model.

* *Example:* Removing URLs to reduce vocabulary size.
* *Pattern:* `https?://\S+|www\.\S+`


**PII Redaction (Privacy):** Anonymizing data by detecting and masking Personally Identifiable Information (emails, phone numbers, SSNs) before training to ensure privacy compliance.

* *Example:* Replacing emails with a `<EMAIL>` token.
* *Pattern:* `\b[\w\.-]+@[\w\.-]+\.\w{2,4}\b`


**Contraction Expansion:** Converting "don't" to "do not" or "I'm" to "I am" to standardize input for tokenizers.

#### 2. Feature Extraction (Structured Data from Unstructured Text)

Before feeding text into a model (like BERT or a Logistic Regressor), you often need to extract explicit features to feed into a separate dense layer or to use for stratification.

**Extracting Metadata:** Pulling dates, prices, invoice numbers, or ID codes from unstructured strings to create new tabular features.

* *Example:* Extracting dates from "Meeting on 2023-05-12".
* *Pattern:* `\d{4}-\d{2}-\d{2}`

**Specific Entity Identification:** Detecting specific codes (like stock tickers `\$[A-Z]+` or error codes `Error \d{3}`) to create binary features (e.g., `has_error_code`).


#### 3. Weak Supervision & Labeling (Snorkel/Heuristics)

In scenarios where you lack labeled data, you can use Regex to create "noisy" labels (Weak Supervision) to bootstrap a dataset.

**Rule-Based Labeling Functions:** If you are building a spam classifier, you might write a regex to label any text containing "BUY NOW" or "CLICK HERE" as `SPAM`.

**Filtering Positive/Negative Samples:** Using regex to filter a massive unlabelled dataset to find samples that *likely* belong to a specific class to send to human annotators.


---


#### Summary Table: Regex in ML

| Stage | Goal | Example Use Case |
| --- | --- | --- |
| **Preprocessing** | Clean Text | Remove HTML tags (`<.*?>`) from web crawl data. |
| **Feature Eng.** | Create Features | Extract "1200 sqft" to create a numerical `area` feature. |
| **Labeling** | Weak Labels | Label text as "Urgent" if it matches `\b(ASAP |

## The `regex` library in Python

The Python docs for the built-in [`re`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%5D(https://docs.python.org/3/library/re.html)) library states:

> See also The third-party [`regex` module](https://www.google.com/search?q=%5Bhttps://pypi.org/project/regex/%5D(https://pypi.org/project/regex/)), which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support.

Our discussion in this section will be about the `regex` library (which has a similar API to `re`).
We import it like any other library:

In [2]:
%pip install regex==2024.5.15 --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/788.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m788.8/788.8 kB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
# Standard library imports
# (none needed for this notebook)

# Third-party imports
import regex as re

Characters are used to define the pattern, so some of them have special significance unlike others.

**Meta-characters**: These are characters with a specific meaning other than what they appear to be; they are:

```
. ^ $ * + ? { } [ ] \ | ( )

```

## Character Class

We will start by explaining the **Character Class** `[ ]` selector, which works as follows:

* `[abc]` matches a, b, or c.
* `[^abc]` matches any character *except* a, b, or c.
* `[a-c]` matches any character from a to c: `a`, `b`, or `c`.
* `[3-7]` matches any digit from 3 to 7: `3`, `4`, `5`, `6`, or `7`.
* `[a-zA-Z]` matches any lowercase or uppercase letter from a to z or A to Z.
* `[a-zA-Z0-9_]` matches any lowercase or uppercase letter, digit, or `_` (underscore).

From this, you know the meaning of:

* The `^` sign in contexts like `[^`: it indicates the **negation of the class**.
* The `-` sign in contexts like `[a-b]`: it indicates the **characters between the two characters (inclusive)**.

Consider the following example:

* Note that the pattern `[0-9][0-9]` matches any two consecutive digits from 0 to 9.
* The function `regex.search` is used to search inside `text` for the `pattern`; the result is of type `Match` if the pattern is found, otherwise it is `None`.

To extract the matching text, we use the method `match.group()` like this:


In [4]:
text = "I am 21 years old"
pattern = "[0-9][0-9]"
match = re.search(pattern, text)
print(match)

<regex.Match object; span=(5, 7), match='21'>


In [5]:
if match:
    print(match.group())

21


In [6]:
text = "I am 21 years old"
pattern = r"\d\d"
match = re.search(pattern, text)
if match:
    print(match.group())

21


## Classes with Symbols

Joining the characters mentioned above are:

* **`\d`** and its opposite **`\D`**: Matches any decimal digit; equivalent to `[0-9]`.

* **`\w`** and its opposite **`\W`**: Matches alphanumeric characters and the underscore `[a-zA-Z0-9_]`. Regarding Unicode patterns (str), it matches alphanumeric characters in Unicode (as defined by [`str.isalnum()`](https://docs.python.org/3/library/stdtypes.html#str.isalnum)), plus the underscore `_`.

* **`\s`** and its opposite **`\S`**: Matches whitespace; equivalent to `[ \t\n\r\f\v]` (note the inclusion of the space).

* **`\b`** and its opposite **`\B`**: Matches a word boundary.

Because Python considers the `\` mark in strings as a sign for special characters ([Escape Character](http://docs.python.org/3/reference/lexical_analysis.html#escape-sequences)) such as:

* `\n` for new line
* `\t` for whitespace (tab)
* `\r` for return to start (carriage return)
* `\f` for form feed
* `\v` for vertical tab

To avoid conflict between the intended `\` character in standard Python strings and the shortcuts we want, Python introduced the `r` character to disable the special nature of `\` so that string patterns can be written as **Raw Strings**, as shown below.


In [7]:
animals = [
    "cat",
    "bat",
    "dog",
    "rat",
]

pattern = r"\wa\w"

for animal in animals:
    match = re.search(pattern, animal)
    if match:
        print(match.group())

cat
bat
rat



### Numbers

::: {.callout-tip}
Always use **Raw Strings** (with the `r` prefix) when writing patterns.
:::


In [8]:
text = "I am 21 years old"
pattern = r"\d\d"
match = re.search(pattern, text)
if match:
    print(match.group())

21


In [9]:
text1 = "My ear"
text2 = "I arrived early"
text3 = "This is the end of the year"

In [10]:
entire = r"\b" + "ear" + r"\b"
print(re.search(entire, text1))
print(re.search(entire, text2))
print(re.search(entire, text3))

<regex.Match object; span=(3, 6), match='ear'>
None
None


In [11]:
starts = r"\b" + "ear"
print(re.search(starts, text1))
print(re.search(starts, text2))
print(re.search(starts, text3))

<regex.Match object; span=(3, 6), match='ear'>
<regex.Match object; span=(10, 13), match='ear'>
None


In [12]:
ends   = "ear" + r"\b"
print(re.search(ends, text1))
print(re.search(ends, text2))
print(re.search(ends, text3))

<regex.Match object; span=(3, 6), match='ear'>
None
<regex.Match object; span=(24, 27), match='ear'>



### Letters

For matching letters, we might use the shortcut `\w`.
The following pattern says: **Any two characters with the letter `a` in between**, such as: `cat`, `bat`, or `rat`. The excluded word from this pattern is `dog`.

Note that the pattern `\wa\w` consists of three parts:

1. `\w` matches alphanumeric characters and the underscore.
2. `a` matches the character `a` as is.
3. `\w` matches alphanumeric characters and the underscore.


In [13]:
prices = [
    "it costs 123",
    "I bought it for 12.3 last time",
    "I paid 12.34 SAR for it"
]

In [14]:
pattern = r"\d+(\.\d+)?"

In [15]:
for p in prices:
    match = re.search(pattern, p)
    if match:
        print('price:', match.group())

price: 123
price: 12.3
price: 12.34


### Word Boundary `\b`

The symbol `\b` matches a boundary, and is used for contexts where we want a full word match, not a part of another word.

In [16]:
text1 = "My ear"
text2 = "I arrived early"
text3 = "This is the end of the year"

Note in the following example that the pattern only matches the first text, because it requires the word to have a boundary at both the beginning and the end:

In [17]:
entire = r"\b" + "ear" + r"\b"
print(re.search(entire, text1))
print(re.search(entire, text2))
print(re.search(entire, text3))

<regex.Match object; span=(3, 6), match='ear'>
None
None


Note in the following example that it matches the first and second, because the requirement is a boundary at the beginning only:

In [18]:
starts = r"\b" + "ear"
print(re.search(starts, text1))
print(re.search(starts, text2))
print(re.search(starts, text3))

<regex.Match object; span=(3, 6), match='ear'>
<regex.Match object; span=(10, 13), match='ear'>
None


And this last example matches the first and the last, because they have a boundary at the end:

In [19]:
ends   = "ear" + r"\b"
print(re.search(ends, text1))
print(re.search(ends, text2))
print(re.search(ends, text3))

<regex.Match object; span=(3, 6), match='ear'>
None
<regex.Match object; span=(24, 27), match='ear'>


In [20]:
text = "Pushups 20 reps 4 sets"

pattern = r"\d+"

In [21]:
print(re.match(pattern, text))

None


In [22]:
print(re.match(pattern, "20"))

<regex.Match object; span=(0, 2), match='20'>


In [23]:
print(re.search(pattern, text))

<regex.Match object; span=(8, 10), match='20'>


In [24]:
m = re.search(pattern, text)
print(m.group())
print(m.start(), m.end())
print(m.span())

20
8 10
(8, 10)


In [25]:
ol = re.findall(pattern, text)
print(ol)

['20', '4']


In [26]:
it = re.finditer(pattern, text)
for m in it:
    print(m)

<regex.Match object; span=(8, 10), match='20'>
<regex.Match object; span=(16, 17), match='4'>


In [27]:
prices = [
    "it costs 123",
    "I bought it for 12.3 last time",
    "I paid 12.34 SAR for it"
]

In [28]:
pattern = r"\d+(\.\d+)?"

In [29]:
for p in prices:
    match = re.search(pattern, p)
    if match:
        print('price:', match.group())

price: 123
price: 12.3
price: 12.34


## Repetition and Quantification

Repetition marks / **Quantifiers** are used to specify the number of repetitions of the preceding pattern. They are as follows:

* `{3}` exactly three times.
* `{2,4}` from 2 to 4 times.
* `{3,}` three times or more.
* `+` one or more times.
* `*` zero or more times.
* `?` zero or exactly one time.

To define a group of characters within a pattern, we use parentheses: `( )` around the pattern to make it a **Match Group** ([Match Group](https://docs.python.org/3/howto/regex.html#grouping)).

For example, if you want to match the price in the following texts:

In [30]:
prices = [
    "it costs 123",
    "I bought it for 12.3 last time",
    "I paid 12.34 SAR for it"
]

You would use the following pattern:

In [31]:
pattern = r"\d+(\.\d+)?"

It consists of three parts:

* `\d+` consists of two parts:
  * `\d` a digit
  * `+` one or more times
* `(...)?` what is between the parentheses: zero or exactly one time
  * `\.` we need the `\` sign to disable the special property of the dot character, otherwise it would match any character
  * `\d+` a digit, one or more times

In [32]:
for p in prices:
    match = re.search(pattern, p)
    if match:
        print('price:', match.group())

price: 123
price: 12.3
price: 12.34


In [33]:
text = "She is she."

for m in re.finditer(r"[a-z]+", text):
    print(m)

print()

for m in re.finditer(r"[a-z]+", text, re.IGNORECASE):
    print(m)

<regex.Match object; span=(1, 3), match='he'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>

<regex.Match object; span=(0, 3), match='She'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>


In [34]:
# Example: Using multiple flags
re.search(pattern, text, re.IGNORECASE | re.LOCALE)

## Matching

There are four basic functions for matching in Python:

* `re.match(pattern, string, flags=0) -> Match | None`
  * Matches the pattern at the *beginning* of the string.


* `re.search(pattern, string, flags=0) -> Match | None`
  * Searches for the *first* match of the pattern in the string.


* `re.findall(pattern, string, flags=0) -> list`
  * Creates a list of *all* matches for the pattern in the string.


* `re.finditer(pattern, string, flags=0) -> Iterator[Match[str]]`
  * Creates an iterator for *all* matches for the pattern in the string.

The result of this **Matching**, which is a [`Match`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23match-objects%5D(https://docs.python.org/3/library/re.html%23match-objects)) object, includes four attributes:

* [`group()`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.Match.group%5D(https://docs.python.org/3/library/re.html%23re.Match.group))
* `start()`
* `end()`
* `span()`

The concept becomes clear with an example:

In [35]:
text = "Pushups 20 reps 4 sets"

pattern = r"\d+"

First point: Matching with the `.match()` function occurs from the beginning of the text. Consequently, if you tried to match this whole text, you wouldn't find a match, even though matching works for the number alone.
Matching on the full text:

In [36]:
print(re.match(pattern, text))

None


Matching on the number alone (successful):

In [37]:
print(re.match(pattern, "20"))

<regex.Match object; span=(0, 2), match='20'>


Therefore, we use the `.search()` function to look for the pattern in any position of the text:



In [38]:
print(re.search(pattern, text))

<regex.Match object; span=(8, 10), match='20'>


We can access the match attributes:

In [39]:

m = re.search(pattern, text)
print(m.group())
print(m.start(), m.end())
print(m.span())

20
8 10
(8, 10)



However, the match returned a single number, and our intention was to match both of them: `20` and `4` in the example. Therefore, we use the `.findall()` procedure as follows:

In [40]:
ol = re.findall(pattern, text)
print(ol)


['20', '4']


If you want the `Match` object (not just the matched text), use the `.finditer()` procedure as follows:

In [41]:
it = re.finditer(pattern, text)
for m in it:
    print(m)

<regex.Match object; span=(8, 10), match='20'>
<regex.Match object; span=(16, 17), match='4'>


In [42]:
text = "This dates back to 1970-06-29"
pattern = r"(\d{4})-(\d{2})-(\d{2})"
match = re.search(pattern, text)

if match:
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))

1970-06-29
1970
06
29


In [43]:
text = "20 Reps 4 Sets"

m = re.search(r"((\d+) Reps) ((\d+) Sets)", text)
if m:
    print(m.group(1))
    print(m.group(2))
    print(m.group(3))
    print(m.group(4))

20 Reps
20
4 Sets
4


## Adjusting the Matching Process

[Flags](https://docs.python.org/3/howto/regex.html#compilation-flags) are used to adjust matching in several ways:

| Flag | Meaning |
| --- | --- |
| [`ASCII`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.ASCII%5D(https://docs.python.org/3/library/re.html%23re.ASCII)), [`A`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.A%5D(https://docs.python.org/3/library/re.html%23re.A)) | Makes symbols like `\w`, `\b`, `\s` and `\d` match only ASCII characters. |
| [`DOTALL`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.DOTALL%5D(https://docs.python.org/3/library/re.html%23re.DOTALL)), [`S`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.S%5D(https://docs.python.org/3/library/re.html%23re.S)) | Makes the symbol `.` match any character, including newlines. |
| [`IGNORECASE`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.IGNORECASE%5D(https://docs.python.org/3/library/re.html%23re.IGNORECASE)), [`I`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.I%5D(https://docs.python.org/3/library/re.html%23re.I)) | Makes matching case-insensitive. |
| [`LOCALE`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.LOCALE%5D(https://docs.python.org/3/library/re.html%23re.LOCALE)), [`L`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.L%5D(https://docs.python.org/3/library/re.html%23re.L)) | Makes matching take locale settings into account. |
| [`MULTILINE`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.MULTILINE%5D(https://docs.python.org/3/library/re.html%23re.MULTILINE)), [`M`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.M%5D(https://docs.python.org/3/library/re.html%23re.M)) | Enables multi-line matching, affecting the `^` and `$` symbols. |
| [`VERBOSE`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.VERBOSE%5D(https://docs.python.org/3/library/re.html%23re.VERBOSE)), [`X`](https://www.google.com/search?q=%5Bhttps://docs.python.org/3/library/re.html%23re.X%5D(https://docs.python.org/3/library/re.html%23re.X)) (for "Extended") | Allows writing regular expressions structured more clearly and easier to understand. |

We illustrate using the flag `re.IGNORECASE` as we need it for matching Latin words. Note the difference in the two matches:

In [44]:
text = "She is she."

for m in re.finditer(r"[a-z]+", text):
    print(m)

print()

for m in re.finditer(r"[a-z]+", text, re.IGNORECASE):
    print(m)

<regex.Match object; span=(1, 3), match='he'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>

<regex.Match object; span=(0, 3), match='She'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>




To use a set of flags, we merge them with the `|` sign as follows:

```python
re.search(pattern, text, re.IGNORECASE | re.LOCALE)

```


In [45]:
text = "Muhammad AlKhwarizmi, Polymath"

m = re.search(r"(?P<first_name>\w+) (?P<last_name>\w+), \w+", text)
if m:
    print(m.groupdict())
    print(m.group('first_name'))
    print(m.group('last_name'))

{'first_name': 'Muhammad', 'last_name': 'AlKhwarizmi'}
Muhammad
AlKhwarizmi


In [46]:
text = "20 Reps 4 Sets"

m = re.search(r"((?P<reps>\d+) Reps) ((?P<sets>\d+) Sets)", text)
if m:
    print(m.groupdict())
    print(m.group('reps'))
    print(m.group('sets'))

{'reps': '20', 'sets': '4'}
20
4


## Extracting Groups from Text

**Quantifiers** are used to describe a date as: four digits, then a dash, then two digits, then a dash, then two digits.
Parentheses are used to divide them when reading.

In [47]:
text = "This dates back to 1970-06-29"
pattern = r"(\d{4})-(\d{2})-(\d{2})"
match = re.search(pattern, text)

if match:
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))

1970-06-29
1970
06
29


Note that group `0` matches the *whole*.
Therefore, when we used `.group()` before, the default was to extract all the text if it matched the pattern.
But now we know that `1`, `2`, and `3` match parts within the pattern itself according to the parentheses placed in it.

Here is another example involving **nesting parentheses inside parentheses**, which clarifies the significance of the numbers given to the `.group(n)` procedure:

In [48]:
text = "20 Reps 4 Sets"

m = re.search(r"((\d+) Reps) ((\d+) Sets)", text)
if m:
    print(m.group(1))
    print(m.group(2))
    print(m.group(3))
    print(m.group(4))

20 Reps
20
4 Sets
4


And so on.

In [49]:
text1 = "She is she."
text2 = "They are they."

patternc = re.compile(r"[a-z]+", re.IGNORECASE)

for m in patternc.finditer(text1):
    print(m)

print()

for m in patternc.finditer(text2):
    print(m)

<regex.Match object; span=(0, 3), match='She'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>

<regex.Match object; span=(0, 4), match='They'>
<regex.Match object; span=(5, 8), match='are'>
<regex.Match object; span=(9, 13), match='they'>


In [50]:
text1 = "She is she."
text2 = "They are they."

for m in re.finditer(r"[a-z]+", text1, re.IGNORECASE):
    print(m)

print()

for m in re.finditer(r"[a-z]+", text2, re.IGNORECASE):
    print(m)

<regex.Match object; span=(0, 3), match='She'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>

<regex.Match object; span=(0, 4), match='They'>
<regex.Match object; span=(5, 8), match='are'>
<regex.Match object; span=(9, 13), match='they'>


## Naming Groups

One of the features of the pattern matching engine in Python specifically is the ability to name groups in the pattern so they can be extracted by name.
The naming comes after the first parenthesis like this: `(?P<name>...)` where `...` represents the string pattern.
This is done as follows:


text = "Muhammad AlKhwarizmi, Polymath"

m = re.search(r"(?P<first_name>\w+) (?P<last_name>\w+), \w+", text)
if m:
    print(m.groupdict())
    print(m.group('first_name'))
    print(m.group('last_name'))


It works the same way in nested groups:


In [51]:
text = "20 Reps 4 Sets"

m = re.search(r"((?P<reps>\d+) Reps) ((?P<sets>\d+) Sets)", text)
if m:
    print(m.groupdict())
    print(m.group('reps'))
    print(m.group('sets'))

{'reps': '20', 'sets': '4'}
20
4



This makes extracting patterns from texts much easier.

::: {.callout-note}
Group naming is a Python-only feature, and is not part of the general regular expression specification.
This means you may not find it in other engines such as interactive editors found on the web.
:::


In [52]:
text = """
Pushups 30 reps 3 sets
5 reps 2 sets Pullups
2 Sets 15 Reps One-leg Squats
4 sets 8 reps 22.5 lbs Dumbbell Rows
4 sets 8 reps 15.25kg Dumbbell Rows
"""

## Compiling the Regular Expression Once

We have used library procedures directly like: `re.match()` and `re.search()`, etc. These take the pattern, compile it, and then execute it with a package written in C.
If the pattern is used frequently, the compilation process is performed multiple times, **which is a waste!**
To compile the pattern once and then apply it multiple times (without repeating the compilation), we use the **compilation** procedure `re.compile()` as follows:

In [53]:
text1 = "She is she."
text2 = "They are they."

patternc = re.compile(r"[a-z]+", re.IGNORECASE)

for m in patternc.finditer(text1):
    print(m)

print()

for m in patternc.finditer(text2):
    print(m)

<regex.Match object; span=(0, 3), match='She'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>

<regex.Match object; span=(0, 4), match='They'>
<regex.Match object; span=(5, 8), match='are'>
<regex.Match object; span=(9, 13), match='they'>



To see the difference, compare it with the snippet where **we did not use pre-compilation**; it repeats every time:


In [54]:
text1 = "She is she."
text2 = "They are they."

for m in re.finditer(r"[a-z]+", text1, re.IGNORECASE):
    print(m)

print()

for m in re.finditer(r"[a-z]+", text2, re.IGNORECASE):
    print(m)

<regex.Match object; span=(0, 3), match='She'>
<regex.Match object; span=(4, 6), match='is'>
<regex.Match object; span=(7, 10), match='she'>

<regex.Match object; span=(0, 4), match='They'>
<regex.Match object; span=(5, 8), match='are'>
<regex.Match object; span=(9, 13), match='they'>



## Editing Regular Expressions

We recommend using **Regular Expression editing** tools such as: [regex101](https://regex101.com/); they are much better than writing it without a tool.

* In the sidebar, choose the **Python** Flavor.
* In the first field, write the regular expression.
* In the large box, place the text you want to match.

There is also another editor like [regexr](https://regexr.com/), and in the sidebar, you will find **Community Patterns** where you can find an index of text patterns shared by other programmers. Or at: [regexHQ](https://github.com/regexhq).

Thus, you modify the pattern and add to the texts until you reach the best pattern to copy and place in your program.

This editor uses the same engine as: [pythex](https://pythex.org/).



## Other Resources for Learning Regular Expressions

* [Learn Regex The Easy Way](https://github.com/ziishaned/learn-regex)

Interactive lessons for learning regular expressions:

* [RegexLearn](https://regexlearn.com/)
* [RegexOne](https://regexone.com/)


da

## Key Takeaways

- **Regular Expressions (Regex)** are powerful pattern-matching tools essential for text processing in NLP and data engineering pipelines.

- Regex is indispensable in ML/DL workflows for:
  - **Data Cleaning & Preprocessing**: Removing noise (HTML tags, URLs, hashtags), PII redaction, contraction expansion
  - **Feature Extraction**: Extracting metadata (dates, prices, IDs) from unstructured text
  - **Weak Supervision & Labeling**: Creating rule-based labels for unlabeled data

- **Character Classes** `[ ]` allow matching sets of characters:
  - `[abc]` matches a, b, or c
  - `[^abc]` matches any character except a, b, or c
  - `[a-z]` matches any lowercase letter from a to z

- **Metacharacters** have special meanings: `. ^ $ * + ? { } [ ] \ | ( )`

- **Quantifiers** control how many times a pattern matches:
  - `*` (0 or more), `+` (1 or more), `?` (0 or 1), `{n}` (exactly n), `{n,m}` (between n and m)

- **Anchors** match positions: `^` (start), `$` (end), `\b` (word boundary)

- **Groups** `()` allow capturing and extracting specific parts of matches

- The **`regex` library** in Python provides enhanced Unicode support and additional functionality compared to the standard `re` module

- Use regex editing tools like [regex101.com](https://regex101.com/) or [regexr.com](https://regexr.com/) for developing and testing patterns

- For large-scale keyword replacement/extraction, consider **FlashText** as a faster alternative to regex

## Pro-tip: **FlashText**

[FlashText](https://pypi.org/project/flashtext/) can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm ([Paper: Replace or Retrieve Keywords In Documents at Scale](https://arxiv.org/abs/1711.00046)).