# Regex Essentials: Overview
- Regular expressions (regex) are a language for defining text search patterns.  
- Python’s `re` module provides functions like `search` (find anywhere) and `match` (anchored at start).  
- Patterns include literals, metacharacters (`. ^ $ * + ? [] \`), character classes (`\d`, `\w`, `\s`), and quantifiers (`*`, `+`, `?`, `{n,m}`).  
- Greedy quantifiers (`*`, `+`) match as much as possible; non-greedy (`*?`, `+?`) as little as possible.

| Character | Matches | Example | Example matches |
|-----------|-------------|---------|-------------|
| ^ | Start of the string | ^hello | line starting with hello |
| $ | End of the string | ^hello$ | line ending with hello |
| . | Any single character  | hel.o | hello, heloo |  
| * | Preceding character 0 or more times | hel*o <br>- 'h', 'e' must be present <br>- followed by zero or more 'l' <br> -'o' must be present | helo, hello, helllo | 
| ? | Preceding character 0 or one time | hel?o <br>- 'h', 'e' must be present <br>- followed by zero or one 'l' <br>- 'o' must be present | helo, hello |
| + | Preceding character 1 or more times | hel+o <br>- 'h', 'e' must be present <br>- followed by one or more 'l' <br>- 'o' must be present | hello, helllo, hellllo |
| [] | Any one of the characters enclosed in [] | he[lo] | hel, heo |
| [-] | Any character within the range | file[1-3] | file1, file2, file3 |
| [^] | Any characters except those enclosed in [] | file[^123] | file4, file5 |
| {n} | Preceding characher must match exactly n times | he.{2}o <br>- . before {} means any char | hello, hekko, heppo |
| {n,} | Preceding characher must match at least n times | he.{2,}o <br>- . before {} means any char | hello, hekkkko, hepppppo |
| {n,m} | Preceding characher must match at least n times and at most m times | he.{2,3}o <br>- . before {} means any char | hello, hekkko, hepppo |
| \ | Escape character - escape any metacharacter | hell\\.o <br> . is a special character. \\. ignores the meaning of . | hell.o |




## Introduction to `re.search()` vs `re.match()`
- `re.search(pattern, text)` scans the **entire string** for the **first occurrence**.  
- `re.match(pattern, text)` checks only at the **beginning of the string**.
- `re.findall()` and `re.finditer()` let you retrieve **every occurrence of a pattern**.   
- Always use **raw strings** (`r"..."`) to define regex patterns, avoiding Python string escapes interfering with regex.

In [None]:
import re

line = "WARN: Disk usage at 91%"
pattern = r"Disk"

print(f"search '{pattern}':", bool(re.search(pattern, line))) # checks the entire string
print(f"match '{pattern}':", bool(re.match(pattern, line)))   # checks only at the begining of string

search 'Disk': True
match 'Disk': False


## Common Metacharacters
- `.` matches any character (except newline).  
- `^` anchors at start of string.  
- `$` anchors at end of string.  
- `[]` defines a set or range of characters, e.g. `[A-Z]`.  
- `\` escapes metacharacters or introduces special sequences.

In [2]:
import re

test = "Error code: E1234. cxge"

# Start and end
print(f"Starts with 'Error': {re.findall(r"^Error", test)}")
print(f"Starts with 'E1234' : {re.findall(r"^E1234", test)}")
print(f"End with cxge: {re.findall(r"cxge$", test)}")

# Any character
print(f"Dot matches any character: {re.findall(r"c..e", test)}")
print(f"Dot matches any character: {re.findall(r"c.d", test)}")

# Zero or more times
print(f"Zero or more times: {re.findall(r"Er*or", test)}")
print(f"Zero or one time: {re.findall(r"Er?or", test)}")
print(f"One or more time: {re.findall(r"Er+or", test)}")

# Character set
print(f"Character set: {re.findall(r"[E0-9]+", test)}")

# Matches n times
print(f"Matches 'r' at least 2 times: {re.findall(r"r{2}", test)}")
print(f"Matches 'r' at least 1 and at most 2 times: {re.findall(r"r{1,2}", test)}")


Starts with 'Error': ['Error']
Starts with 'E1234' : []
End with cxge: ['cxge']
Dot matches any character: ['code', 'cxge']
Dot matches any character: ['cod']
Zero or more times: ['Error']
Zero or one time: []
One or more time: ['Error']
Character set: ['E', 'E1234']
Matches 'r' at least 2 times: ['rr']
Matches 'r' at least 1 and at most 2 times: ['rr', 'r']


## Special Sequences
- `\d` digit (0–9), `\D` non-digit.  
- `\w` word character (letters, digits, underscore), `\W` non-word (whitespace and punctuation such as `-`, `!`, `;` etc.)   
- `\s` whitespace, `\S` non-whitespace.  
- `\b` word boundary (zero-width match). Useful for matching whole words and avoid substrig matches e.g., prefix, sufix etc.

In [None]:
import re

text = "The cat scattered 1024 catalogues in Room-08."

# Match digits
print(f"Digits: {re.findall(r"\d+", text)}")                # '+' means one or more 

# Match words
print(f"Word characters: {re.findall(r"\w", text)}")  # matches each letter or digit one at a time, not whitespace and punctuation
print(f"Word characters: {re.findall(r"\w+", text)}") # matches sequence of chars since '+' matches one or more chars

# Match spaces
print(f"Whitespace: {re.findall(r"\s+", text)}")
print(f"Word boundary: {re.findall(r"\bcat\b", text)}") # matches whole single word

Digits: ['1024', '08']
Word characters: ['T', 'h', 'e', 'c', 'a', 't', 's', 'c', 'a', 't', 't', 'e', 'r', 'e', 'd', '1', '0', '2', '4', 'c', 'a', 't', 'a', 'l', 'o', 'g', 'u', 'e', 's', 'i', 'n', 'R', 'o', 'o', 'm', '0', '8']
Word characters: ['The', 'cat', 'scattered', '1024', 'catalogues', 'in', 'Room', '08']
Whitespace: [' ', ' ', ' ', ' ', ' ', ' ']
Word boundary: ['cat', 'cat', 'cat']


## Quantifiers

| Quantifier | Matches                                 | Greedy? | Non-greedy form  | Matches                                 |
|------------|-----------------------------------------|---------|------------------|------------------------------------------|
| `?`        | 0 or 1 of the preceding token           | Yes     | `??`             | as few as possible (0 or 1)              |
| `*`        | 0 or more of the preceding token        | Yes     | `*?`             | as few as possible (including zero)      |
| `+`        | 1 or more of the preceding token        | Yes     | `+?`             | as few as possible (at least one)        |
| `{n}`      | exactly n of the preceding token        | -       | -                | -                                        |
| `{n,}`     | n or more of the preceding token        | Yes     | `{n,}?`          | n or more, but as few as possible        |
| `{n,m}`    | between n and m of the preceding token  | Yes     | `{n,m}?`         | between n and m, but as few as possible  |

In [61]:
import re

text = "aaaa"

print(re.findall(r"a?", text))
print(re.findall(r"a*", text))
print(re.findall(r"a+", text))
print(re.findall(r"a{2}", text))
print(re.findall(r"a{1,3}", text))

print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")

['a', 'a', 'a', 'a', '']
['aaaa', '']
['aaaa']
['aa', 'aa']
['aaa', 'a']
Non-greedy a*: ['', 'a', '', 'a', '', 'a', '', 'a', '']
Non-greedy a+: ['a', 'a', 'a', 'a']
Non-greedy a{1,3}?: ['a', 'a', 'a', 'a']


## Quantifiers & Greedy vs Non-Greedy
- `*` / `+` / `{n,}` are greedy: match as much as possible.  
- Append `?` (`*?` / `+?` / `{n,}?`) to make them non-greedy: match as little as possible.  
- Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a `?` after them makes them non-greedy (or lazy), matching the shortest possible string.

In [68]:
import re

html = "<p>One</p><p>Two</p><></>"

print(f"Greedy: {re.findall(r"<.*>", html)}")
print(f"Non-greedy: {re.findall(r"<.*?>", html)}")

Greedy: ['<p>One</p><p>Two</p><></>']
Non-greedy: ['<p>', '</p>', '<p>', '</p>', '<>', '</>']
