# 03. üîç Standard Library Reference: `re` (Regex) DEEP DIVE

In computing, a *regular expression* , aslo referred to as `regex` or `regexp`. provides a concise and flexible means for matching strings of text, such as particular character, words, or patterns of characters, A regular expression is written in a formal language that can be interpreted by a regular expression processor.

Regex is a miniature, powerful programming language embedded within Python. Mastering it allows you to define and extract data based on patterns, not just literal strings.

**Key Topics Covered (Deep Dive):**
* **Architecture:** `re.compile()` for performance and module-level functions.
    * **Syntax I:** Character Classes, Boundaries, and Anchors.
    * **Syntax II:** Quantifiers, Greedy vs. Non-Greedy matching.
    * **Advanced:** Capture Groups, Backreferences, and Lookarounds.
    * **Applications:** `re.sub()` for Substitution and `re.split()` for tokenization.
---
- Very powerful and quite cryptic
- Fun once you understand them
- Regular expressions are a language unto themselves
- A language of 'marker characters' - Programming with characters
- It is kind of an 'old school' language - Compact
---

Befor you can use `re`, you must import it:

In [2]:
import re
# üí° Engineering Tip: Always use the raw string prefix 'r' (e.g., r"\d+").
    # This prevents Python from misinterpreting special sequences like \t or \b (backspace) as it reads the pattern.

## 1.1 üõ†Ô∏è Core Functions & Compilation

    
Pattern objects are compiled into bytecode for execution. For patterns used inside loops, pre-compiling saves significant execution time.

    
| Function | Behavior | Output | Performance Note |
| :--- | :--- | :--- | :--- |
| **`re.compile()`** | Converts the pattern string into a reusable pattern object. | Pattern Object | Mandatory for loops. |
| **`re.search()`** | Scans the *entire* string for the **first match**. | Match Object (or None) |
| **`re.match()`** | Finds a match *only* at the **start** of the string (index 0). | Match Object (or None) | Avoid using `.*` with this. |
| **`re.findall()`** | Finds *all* non-overlapping matches. | List of Strings/Tuples |
| **`re.sub(p, r, s)`** | Substitutes all matches of pattern `p` with replacement `r` in string `s`. | New String | Replacement can use backreferences. |
    

In [2]:
pattern_email = re.compile(r'\w+@\w+\.\w+')
text = "Contact support@example.com for help."

    # Example 1: re.search (Finding the first instance)
match = pattern_email.search(text)
print(f"Search Match: {match.group(0) if match else 'None'}")

    # Example 2: re.match (Fails because pattern is not at index 0)
match_start = pattern_email.match(text)
print(f"Match at Start: {match_start.group(0) if match_start else 'None'}")

    # Example 3: re.findall
text_many = "Emails: a@b.com, c@d.org"
print(f"Find All: {pattern_email.findall(text_many)}")

Search Match: support@example.com
Match at Start: None
Find All: ['a@b.com', 'c@d.org']


## 1.2 üéØ Syntax I: Character Classes and Anchors
    
| Pattern | Matches | Example Match |
| :--- | :--- | :--- |
| **`^`** | **Start** of string. | `^Start` matches 'Start line' |
| **`$`** | **End** of string. | `End$` matches 'The End' |
| **`.`** | Any character (except newline, by default). |
| **`\d`** | Any digit (0-9). | `\d+` matches '123' |
| **`\w`** | Word character (a-z, A-Z, 0-9, _). | `\w+` matches 'user_1' |
| **`\s`** | Any whitespace (space, tab, newline). | `\s+` matches '  \t' |
| **`\S`** | Match ant non-whitespace characters. |
| **`*`** | Repeats a character zero or more times |
| **`*?`** | Repeats a character zero or more times (non-greedy) |
| **`+`** | Repeats a character **one or more times** |
| **`+?`** | Repeats a character **one or more times** (non-greedy) |
| **`[a-zA-Z]`** | Character Set/Range. | Matches any single letter |
| **`[aeiou]`** | Mateches a single character **in the listed set**. |
| **`[^aeiou]`** | Mateches a single character **not in the listed set**. |
| **`[0-9]`** | number Set/Range. | Matches any single number |
| **`[0-9]*`** | numbers Set/Range. | Matches none or more numbers |
| **`[0-9]+`** | numbers Set/Range. | Matches any single number or more |
| **`[^0-9]`** | Negated Set (Matches anything NOT 0-9). |
| **`pipeline character`** | OR operator (Alternative). | `(cat pipeline dog)` matches 'cat' or 'dog' |
| **`\b`** | Word Boundary (The edge of a word). | `\bcat\b` matches 'cat' in 'The cat.' |
| **`\B`** | Non-Boundary (Inside a word). | `\Bcat\B` matches 'cat' in 'concatenate' |
| **`()`** | Indicates where string **extraction** is to *start/end*.| ‡∑Ñ‡∑ô‡∑Ä‡∑ä‡∑Ä‡∂ß ‡∂∏‡∂ß ‡∂ï‡∂±‡∑ì `()` ‡∂á‡∂≠‡∑î‡∂Ω‡∑ö ‡∂≠‡∑í‡∂∫‡∑ô‡∂± ‡∂ß‡∑í‡∂ö |
    
These are the building blocks of any regular expression. Use the raw string prefix `r` to ensure correct interpretation of backslashes.

In [None]:
text = r"The cost is $10.99. Version is alpha_1.0. User ID: 902D08."
text_multiline = "From: Shashika\nDate: 2025-11-20\nTo: System"

    # --- 1. Basic Classes (Meta-Characters) ---
print("\n--- Classes ---")
print(f"Digits (\d+):     {re.findall(r'\d+', text)}") # Finds all contiguous digits
print(f"Word Chars (\w+): {re.findall(r'\w+', text)}") # Finds all words (letters, numbers, underscore)
print(f"Whitespace (\s+): {re.findall(r'\s+', text)}") # Finds spaces/tabs
print(f"Any Char (.):       {re.findall(r'\d.\d', text)}") # Finds '0.9'

    # --- 2. Character Sets ([] and Negation) ---
print("\n--- Character Sets ---")
print(f"[A-Z]: {re.findall(r'[A-Z]', text)}") # Find uppercase letters
print(f"[^\s]: {re.findall(r'\S+', text)}") # Find sequences of non-whitespace characters
print(f"[a-z0-9_]+: {re.findall(r'[a-z0-9_]+', text)}") # Find lowercase words/IDs

    # --- 3. Anchors and Boundaries ---
print(f"\n--- Anchors/Boundaries ---")
print(f"Start (^): {re.findall(r'^From', text_multiline, re.M)}") # re.M flag required for multiline
print(f"Word Boundary (\b): {re.findall(r'\buser\b', text, re.I)}") # Find 'user' only as a whole word
print(f"Non-Boundary (\B): {re.findall(r'\Bate\B', 'concatenate')}") # Find 'ate' not at word boundary

## 1.3 üß™ Advanced I: Repetition and Groups

    
### Quantifiers (Controlling Repetition)
| Quantifier | Meaning | Example | Result |
| :--- | :--- | :--- | :--- |
| **`{m,n}`** | Between m and n times. | `\d{3,5}` | Matches '123', '4567'. |
| **`+`** | 1 or more times (equivalent to `{1,}`). | `A+` | Matches 'A', 'AA', 'AAA'. |
| **`*`** | 0 or more times (equivalent to `{0,}`). | `A*` | Matches '', 'A', 'AA'. |
| **`?`** | 0 or 1 time (Optional). | `home-?brew` | Matches 'homebrew' or 'home-brew'. |
| **`+?`** | Repeats a character **one or more times** (non-greedy) |
| **`*?`** | Repeats a character zero or more times (non-greedy) |
    
### Greedy vs. Non-Greedy Quantifiers
Quantifiers are **Greedy** by default. To make them **Non-Greedy** (match the shortest possible string), append **`?`**.
    

In [None]:
text = "<b>First</b> and <b>Second</b>"

    # Greedy: Matches from the first < to the last >
    # Output: ['<b>First</b> and <b>Second</b>']
print(f"Greedy (*):     {re.findall(r'<.*>', text)}")

    # Non-Greedy: Matches only until the first closing delimiter
    # Output: ['<b>', '</b>', '<b>', '</b>']
print(f"Non-Greedy (*?): {re.findall(r'<.*?>', text)}")

Greedy (*):     ['<b>First</b> and <b>Second</b>']
Non-Greedy (*?): ['<b>', '</b>', '<b>', '</b>']


### Capture Groups (`()`) and Backreferences
Groups allow you to structure the match and extract specific portions. Backreferences (`\1`, `\2`, etc.) refer to captured groups within the pattern itself or the replacement string.

In [None]:
text = "The the cat sat sat down."

    # Pattern to find repeated words (\1 refers back to the first group: (\w+))
pattern_double = re.compile(r'\b(\w+)\s+\1\b')
print(f"Repeated Words: {pattern_double.findall(text)}")

    # Substitution Example: Swapping order
    # Replacement: \2 (John) followed by \1 (Smith)
name_line = "Smith, John"
swap_pattern = r'(\w+),\s*(\w+)'
print(f"Swapped: {re.sub(swap_pattern, r'\2 \1', name_line)}")

Repeated Words: ['sat']
Swapped: John Smith


---
### Using `re.search()`
- Scans the *entire* string for the **first match**.
- Output: Match Object (or None)

In [8]:
hand = open('mbox-short.txt')
# ---------- Using str -----------------
#for i in hand:
#    i = i.rsplit()
#   if i.find('From:') >= 0:
#        print(i)

# ---------- Using re ------------------
# to find match:
for line in hand:
    line = line.rstrip()
    if re.search('From:', line):
        print(line)
# Not printing this line:
# "From": Shashika: "What is the matter?"
# Because ':' is not with 'From'

# Search for lines that start with 'From'
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line):
        #print(line)
        pass

# Search for lines that start with 'F', followed by
# 2 characters (any = .), followed by 'm:'
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^F..m:', line):
        #print(line)
        pass

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [None]:
# Search for lines that start with From and have an at sign
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:.+@', line):
        print(line)


In [None]:
# Search for lines that start with 'X' followed by any non
# whitespace characters and ':'
# followed by a space and any number.
# The number can include a decimal.
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search(r'^X-.*: [0-9.]+', line):
        print(line)

---
### Using `re.findall()`
- Finds *all* non-overlapping matches.
- Output: List of Strings/Tuples

In [None]:
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall(r'\S+@\S+', s)
print(lst)

# This will ignore the pare '@2PM' because, 
# before @ mark there 'must' be one or more non-whitespace characters

['csev@umich.edu', 'cwen@iupui.edu']


In [None]:
# Search for lines that have an at sign between characters
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall(r'\S+@\S+', line)
    if len(x) > 0:
        print(x)

In [None]:
# Search for lines that have an at sign between characters
# The characters must be a letter or number
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall(r'[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line)
    if len(x) > 0:
        print(x)


In [14]:
# Search for lines that have an at sign between characters
# The characters must be a letter or number
# The results will be slightly more accurate than re07.py for email addresses
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall(r'[a-zA-Z0-9\-.]\S+@[a-zA-Z0-9].\S+[a-zA-Z]', line)
    if len(x) > 0:
        print(x)

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200801042109.m04L92hb007923@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject

In [None]:
# Search for lines that start 'X' followed by any non whitespace
# characters and ':' then output the first group of non whitespace
# characters that follows
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall(r'^X\S*: (\S+)', line)
    if not x: continue
    print(x)

In [17]:
# Search for lines that start with 'X' followed by any
# non whitespace characters and ':' followed by a space
# and any number. The number can include a decimal.
# Then print the number if it is greater than zero.
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall(r'^X-.*: ([0-9.]+)', line)
    if len(x) > 0:
        print(x)

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']


In [18]:
# Search for lines that start with 'Details: rev='
# followed by numbers
# Then print the number if one is found
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall(r'^Details:.*rev=([0-9]+)$', line)
    if len(x) > 0:
        print(x)

['39772']
['39771']
['39770']
['39769']
['39766']
['39765']
['39764']
['39763']
['39762']
['39761']
['39760']
['39759']
['39758']
['39757']
['39756']
['39755']
['39754']
['39753']
['39752']
['39751']
['39750']
['39749']
['39746']
['39745']
['39744']
['39743']
['39742']


In [None]:
# Search for lines that start with From and a character
# followed by a two digit number between 00 and 99 followed by ':'
# Then print the number if one is found
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^From .* ([0-9][0-9]):', line)
    if len(x) > 0: print(x)

In [19]:
# Search for lines that contain 'Author:' followed by any characters,
# an at sign, and any non whitespace character
# Then print the character group that follows the at sign
import re
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('Author:.*@(\S+)', line)
    if not x: continue
    print(x)

['uct.ac.za']
['media.berkeley.edu']
['umich.edu']
['iupui.edu']
['umich.edu']
['iupui.edu']
['iupui.edu']
['iupui.edu']
['umich.edu']
['umich.edu']
['umich.edu']
['umich.edu']
['iupui.edu']
['umich.edu']
['caret.cam.ac.uk']
['gmail.com']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['uct.ac.za']
['media.berkeley.edu']
['media.berkeley.edu']
['media.berkeley.edu']
['iupui.edu']
['iupui.edu']
['iupui.edu']


  x = re.findall('Author:.*@(\S+)', line)


In [21]:
# Search for lines that contain 'New Revision: ' followed by a number
# Then turn the number into a float and append it to nums
# Finally print the length and the average of nums
import re
fname = input('Enter file:')
hand = open(fname)
nums = list()
for line in hand:
    line = line.rstrip()
    x = re.findall('New Revision: ([0-9]+)', line)
    if len(x) == 1:
        val = float(x[0])
        nums.append(val)
print(len(nums))
print(int(sum(nums)/len(nums)))


27
39756


## 1.4 ü§Ø Advanced II: Lookarounds & Flags

    
### Lookarounds (Non-Capturing Assertions)
Lookarounds check for context without including that context in the final match. This is key for complex validation and exclusion logic (e.g., finding a word *not* followed by 's').

In [6]:
prices = "$100.00 EUR $200.00 USD"
    
    # Positive Lookahead: (?=USD) - Find digits *followed by* ' USD'
pattern_usd = r'\$\d+\.\d{2}(?=\s*USD)'
print(f"Positive Lookahead (USD): {re.findall(pattern_usd, prices)}")
    
    # Negative Lookahead: (?!EUR) - Find digits *not followed by* ' EUR'
pattern_not_eur = r'\$\d+\.\d{2}(?!\s*EUR)'
print(f"Negative Lookahead (NOT EUR): {re.findall(pattern_not_eur, prices)}")

Positive Lookahead (USD): ['$200.00']
Negative Lookahead (NOT EUR): ['$200.00']


### Flags (Modifying Behavior)
Flags like `re.I` (IGNORECASE) are essential for flexible matching.

In [7]:
text = "Hello World\nSecond Line"

    # re.I (Case Insensitive)
print(f"Case Insensitive: {re.findall(r'hello', text, re.I)}")

    # re.M (Multi-Line): Allows ^ and $ to match start/end of lines, not just the string
print(f"Multiline Start(^): {re.findall(r'^Second', text, re.M)}")

    # re.S (DOTALL): Allows '.' to match *newlines* too
text_dots = "one\ntwo"
print(f"Dot Matches Newline: {re.findall(r'one.two', text_dots, re.S)}")

Case Insensitive: ['Hello']
Multiline Start(^): ['Second']
Dot Matches Newline: ['one\ntwo']


## 1.5 üìù Application: Splitting and Substitution

    
### Splitting with `re.split()`
More powerful than `str.split()` because the delimiter can be a pattern.

In [8]:
text = "Field1:Value1 Field2:Value2"

    # Split by any sequence of non-alphanumeric characters (from PDF example)
print(f"Find All: {re.split(r'\W+', 'This is a test, short and sweet, of split().')}")

    # Split by multiple potential delimiters (space or colon)
print(f"Split by [ :]: {re.split(r'[ :]', text)}")

Find All: ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
Split by [ :]: ['Field1', 'Value1', 'Field2', 'Value2']


---

    
## ÓÅûÊΩÆ Mini-Challenge: The Complex Log Parser

    
**Log Line:** `INFO: Request from 10.0.0.1 failed (Timeout) after 34.5s processing.`

    
**Task:** Write a single pattern with **Capture Groups** to extract:
1.  The IP Address (`10.0.0.1`).
2.  The Error Reason (`Timeout`).
    
3.  The Processing Time (`34.5`).


In [9]:
log_line = "INFO: Request from 10.0.0.1 failed (Timeout) after 34.5s processing."
    
    # Write your solution here
pattern = r'Request from\s+(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*?\((\w+)\) after (\d+\.\d+)s'

match = re.search(pattern, log_line)
if match:
    print("--- Extracted Fields ---")
    print(f"IP Address:      {match.group(1)}")
    print(f"Error Reason:    {match.group(2)}")
    print(f"Processing Time: {match.group(3)}")

--- Extracted Fields ---
IP Address:      10.0.0.1
Error Reason:    Timeout
Processing Time: 34.5


---

    
## ÓÅûÊΩÆ Mini-Challenge: The Log Line Parser

    
**Scenario:** Parse a complex log line to extract specific fields.

    
**Log Line:** `[2025-11-20 09:30:15] [ERROR] User(id=101) failed to connect from IP 192.168.1.1`

    
**Task:** Write a single pattern with **Capture Groups** to extract:
1.  The timestamp (`YYYY-MM-DD HH:MM:SS`).
2.  The error level (`ERROR`).
3.  The User ID (`101`).
    
    
*Hint: Use non-greedy quantifiers or defined character sets.*

In [None]:
log_line = "[2025-11-20 09:30:15] [ERROR] User(id=101) failed to connect from IP 192.168.1.1"
    
    # Write your solution here


In [None]:
# Solution
pattern = r'\[(.*?)\] \[(\w+)\] User\(id=(\d+)\)'
    
    # The groups are:
    # 1: Timestamp
    # 2: ERROR/LEVEL
    # 3: User ID
match = re.search(pattern, log_line)
if match:
    
print(f"Timestamp: {match.group(1)}")
    
print(f"Level:     {match.group(2)}")
    
print(f"UserID:    {match.group(3)}")

IndentationError: expected an indented block after 'if' statement on line 9 (1723278739.py, line 11)

---

    
## üåü Core Insight for Your CSE Career

    
### Debugging Regex
When a complex Regex fails, do not guess.
1.  **Use Raw Strings (`r''`):** This eliminates the **Backslash Plague**, which often misinterprets `\b` or `\d` as Python control characters.
    
2.  **Test Incrementally:** Start with anchors (`^...$`) and build the pattern one small piece at a time.
    
3.  **Use Online Tools:** Use a tool like Regex101 or RegExr to test your pattern visually before bringing it into Python.

### Performance & Compiling
In high-speed data pipelines (FastAPI/MLOps), every millisecond counts. If you are processing 100,000 log lines with the same pattern, compiling the pattern is non-negotiable:

    
```python
    # SLOW: Re-interprets the pattern string every loop
    # for line in log: re.search(r'pattern', line)
    
    # FAST: Compiles the finite-state machine once
    # COMPILED = re.compile(r'pattern')
    # for line in log: COMPILED.search(line)