# Regex Essentials: Overview
- Regular expressions (regex) are a language for defining **text search patterns**.  
- Python’s `re` module provides functions like `search` (find anywhere) and `match` (anchored at start).  
- Patterns include literals, metacharacters (`. ^ $ * + ? [] \`), character classes (`\d`, `\w`, `\s`), and quantifiers (`*`, `+`, `?`, `{n,m}`).  
- Greedy quantifiers (`*`, `+`) match as much as possible; non-greedy (`*?`, `+?`) as little as possible.

| Character |                       Matches | Example | Example matches |
|-----------|---------------------|---------|-------------|
| ^         | Start of the string | ^hello | line starting with hello |
| \$ | End of the string | ^hello$ | line ending with hello |
| . | Any single character  | hel.o | hello, heloo, hel o |  
| ? | Preceding character 0 or 1 time | hel?o <br>- 'h', 'e' must be present <br>- followed by 0 or 1 'l' <br>- 'o' must be present | heo, helo |
| * | Preceding character 0 or more times | hel*o <br>- 'h', 'e' must be present <br>- followed by 0 or more 'l' <br> -'o' must be present | heo, helo, hello, helllo | 
| + | Preceding character 1 or more times | hel+o <br>- 'h', 'e' must be present <br>- followed by 1 or more 'l' <br>- 'o' must be present | hello, helllo, hellllo |
| [] | Any one of the characters enclosed in [] | he[lo] | hel, heo |
| [-] | Any character within the range | file[1-3] | file1, file2, file3 |
| [^] | Any characters except those enclosed in [] | file[^123] | file4, file5 |
| {n} | Preceding characher must match exactly n times | he.{2}o <br>- . before {} means any char | hello, hekko, heppo |
| {n,} | Preceding characher must match at least n times | he.{2,}o <br>- . before {} means any char | hello, hekkkko, hepppppo |
| {n,m} | Preceding characher must match at least n times and at most m times | he.{2,3}o <br>- . before {} means any char | hello, hekkko, hepppo |
| \ | Escape character - escape any metacharacter | hell\\.o <br> . is a special character. \\. ignores the meaning of . | hell.o |




## Introduction to `re.search()` vs `re.match()`
- `re.search(pattern, text)` scans the **entire string** for the **first occurrence**.  
- `re.match(pattern, text)` checks only at the **beginning of the string**.
- `re.findall()` and `re.finditer()` let you retrieve **every occurrence of a pattern**.   
- Always use **raw strings** (`r"..."`) to define regex patterns, avoiding Python string escapes interfering with regex. **NOTE:** Raw string make writing regex easier, but we still need to follow regex escaping rules. E.g., Raw strings avoid writing double blakslashes. E.g., `"\\."` can be written as `r"\."` because `\` is a metachar itself.
    

In [4]:
import re

line = "WARN: Disk usage at 91%"
pattern = r"Disk"

print(f"search '{pattern}':", bool(re.search(pattern, line))) # checks the entire string
print(f"match '{pattern}':", bool(re.match(pattern, line)))   # checks only at the begining of string


search 'Disk': True
match 'Disk': False


## Common Metacharacters
- `.` matches any character (except newline).  
- `^` anchors at start of string.  
- `$` anchors at end of string.  
- `[]` defines a set or range of characters, e.g. `[A-Z]`.  
- `\` escapes metacharacters or introduces special sequences.

In [9]:
import re

test = "Error code: E1234. cxge"

# Start and end
print(f"Starts with 'Error': {re.findall(r"^Error", test)}")
print(f"Starts with 'E1234' : {re.findall(r"^E1234", test)}")
print(f"End with cxge: {re.findall(r"cxge$", test)}")

# Any character
print(f"Dot matches any single character: {re.findall(r"c..e", test)}")
print(f"Dot matches any single character: {re.findall(r"c.d", test)}")

# Zero or more times
print(f"0 or 1 time: {re.findall(r"Er?or", test)}")
print(f"0 or more times: {re.findall(r"Er*or", test)}")
print(f"1 or more time: {re.findall(r"Er+or", test)}")

# Character set
print(f"Character set: {re.findall(r"[E0-9]+", test)}")

# Matches n times
print(f"Matches 'r' at least 2 times: {re.findall(r"r{2}", test)}")
print(f"Matches 'r' at least 1 and at most 2 times: {re.findall(r"r{1,2}", test)}")


Starts with 'Error': ['Error']
Starts with 'E1234' : []
End with cxge: ['cxge']
Dot matches any single character: ['code', 'cxge']
Dot matches any single character: ['cod']
0 or 1 time: []
0 or more times: ['Error']
1 or more time: ['Error']
Character set: ['E', 'E1234']
Matches 'r' at least 2 times: ['rr']
Matches 'r' at least 1 and at most 2 times: ['rr', 'r']


## Special Sequences
- `\d` digit (0–9), `\D` non-digit.  
- `\w` word character (letters, digits, underscore), `\W` non-word (whitespace and punctuation such as `-`, `!`, `;` etc.)   
- `\s` whitespace (space, tab, newline), `\S` non-whitespace.  
- `\b` word boundary (zero-width match). Useful for matching whole words and avoid substrig matches e.g., prefix, sufix etc.

In [8]:
import re

text = "The cat scattered 1024 catalogues in Room-08."

# Match digits
print(f"Digits: {re.findall(r"\d+", text)}")          # '\d' matches each digit one at a time, '\d+' matched 1 or more digit 

# Match words
print(f"Word characters: {re.findall(r"\w", text)}")  # matches each letter/digit 1 at a time, not spaces/punctuation
print(f"Word characters: {re.findall(r"\w+", text)}") # matches sequence of chars until spaces/punct since '+' matches 1 or more chars

# Match spaces
print(f"Whitespace: {re.findall(r"\s+", text)}")

# Boundary
print(f"Word boundary: {re.findall(r"cat", text)}")     # matches all 'cat' including substring 
print(f"Word boundary: {re.findall(r"\bcat\b", text)}") # matches 'cat' single word

Digits: ['1024', '08']
Word characters: ['T', 'h', 'e', 'c', 'a', 't', 's', 'c', 'a', 't', 't', 'e', 'r', 'e', 'd', '1', '0', '2', '4', 'c', 'a', 't', 'a', 'l', 'o', 'g', 'u', 'e', 's', 'i', 'n', 'R', 'o', 'o', 'm', '0', '8']
Word characters: ['The', 'cat', 'scattered', '1024', 'catalogues', 'in', 'Room', '08']
Whitespace: [' ', ' ', ' ', ' ', ' ', ' ']
Word boundary: ['cat', 'cat', 'cat']
Word boundary: ['cat']


## Quantifiers

| Quantifier | Matches                                 | Greedy? | Non-greedy form  | Matches                                 |
|------------|-----------------------------------------|---------|------------------|------------------------------------------|
| `?`        | 0 or 1 of the preceding token           | Yes     | `??`             | as few as possible (0 or 1)              |
| `*`        | 0 or more of the preceding token        | Yes     | `*?`             | as few as possible (including zero)      |
| `+`        | 1 or more of the preceding token        | Yes     | `+?`             | as few as possible (at least one)        |
| `{n}`      | exactly n of the preceding token        | -       | -                | -                                        |
| `{n,}`     | n or more of the preceding token        | Yes     | `{n,}?`          | n or more, but as few as possible        |
| `{n,m}`    | between n and m of the preceding token  | Yes     | `{n,m}?`         | between n and m, but as few as possible  |

In [10]:
import re

text = "aaaa"

print(f"? matches 0 or 1: {re.findall(r"a?", text)}")
print(f"* matches 0 or more: {re.findall(r"a*", text)}")
print(f"+ matches 1 or more: {re.findall(r"a+", text)}")
print(f"{{n}} matches exact number: {re.findall(r"a{2}", text)}")
print(f"{{n,m}} matches: {re.findall(r"a{1,3}", text)}")

print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")


? matches 0 or 1: ['a', 'a', 'a', 'a', '']
* matches 0 or more: ['aaaa', '']
+ matches 1 or more: ['aaaa']
{n} matches exact number: ['aa', 'aa']
{n,m} matches: ['aaa', 'a']
Non-greedy a*: ['', 'a', '', 'a', '', 'a', '', 'a', '']
Non-greedy a+: ['a', 'a', 'a', 'a']
Non-greedy a{1,3}?: ['a', 'a', 'a', 'a']


## Quantifiers & Greedy vs Non-Greedy
- `*` / `+` / `{n,}` are greedy: match as much as possible.  
- Append `?` (`*?` / `+?` / `{n,}?`) to make them non-greedy: match as little as possible.  
- Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a `?` after them makes them non-greedy (or lazy), matching the shortest possible string.

In [21]:
import re

html = "<p>One</p><p>Two</p><></>"

print(f"Greedy: {re.findall(r"<.*>", html)}")      # `.` matches any char, so `*` goes all the way to the end of the string
print(f"Non-greedy: {re.findall(r"<.*?>", html)}") # `.` matches any char however `?` is non-greedy and matches as few as possible

Greedy: ['<p>One</p><p>Two</p><></>']
Non-greedy: ['<p>', '</p>', '<p>', '</p>', '<>', '</>']
