# Regex Essentials: Overview
- Regular expressions (regex) are a language for defining text search patterns.  
- Python’s `re` module provides functions like `search` (find anywhere) and `match` (anchored at start).  
- Patterns include literals, metacharacters (`. ^ $ * + ? [] \`), character classes (`\d`, `\w`, `\s`), and quantifiers (`*`, `+`, `?`, `{n,m}`).  
- Greedy quantifiers (`*`, `+`) match as much as possible; non-greedy (`*?`, `+?`) as little as possible.

## Introduction to `re.search()` vs `re.match()`
- `re.search(pattern, text)` scans the entire string for the first occurrence.  
- `re.match(pattern, text)` checks only at the beginning of the string.
- `re.findall()` and `re.finditer()` let you retrieve every occurrence of a pattern.   
- Always use raw strings (`r"..."`) to define regex patterns, avoiding Python string escapes interfering with regex.

In [25]:
import re

line = "WARN: Disk usage at 91%"
pattern = r"WARN"

print(f"search '{pattern}':", bool(re.search(pattern, line)))
print(f"match '{pattern}':", bool(re.match(pattern, line)))

search 'WARN': True
match 'WARN': True


## Common Metacharacters
- `.` matches any character (except newline).  
- `^` anchors at start of string.  
- `$` anchors at end of string.  
- `[]` defines a set or range of characters, e.g. `[A-Z]`.  
- `\` escapes metacharacters or introduces special sequences.

In [36]:
import re

test = "Error code: E1234. cxge"

print(f"Dot matches any character: {re.findall(r"c..e", test)}")
print(f"Start anchor (finds): {re.findall(r"^Error", test)}")
print(f"Start anchor (does not find): {re.findall(r"^E1234", test)}")
print(f"End anchor: {re.findall(r"cxge$", test)}")
print(f"Character set: {re.findall(r"[E0-9]+", test)}")

Dot matches any character: ['code', 'cxge']
Start anchor (finds): ['Error']
Start anchor (does not find): []
End anchor: ['cxge']
Character set: ['E', 'E1234']


## Special Sequences
- `\d` digit (0–9), `\D` non-digit.  
- `\w` word character (letters, digits, underscore), `\W` non-word.  
- `\s` whitespace, `\S` non-whitespace.  
- `\b` word boundary (zero-width match).

In [42]:
import re

text = "The cat scattered 1024 catalogues."

print(f"Digits: {re.findall(r"\d+", text)}")
print(f"Word characters: {re.findall(r"\w+", text)}")
print(f"Whitespace: {re.findall(r"\s+", text)}")
print(f"Word boundary: {re.findall(r"\bcat\b", text)}")

Digits: ['1024']
Word characters: ['The', 'cat', 'scattered', '1024', 'catalogues']
Whitespace: [' ', ' ', ' ', ' ']
Word boundary: ['cat']


## Quantifier Cheat-Sheet

| Quantifier | Meaning                                 | Greedy? | Non-greedy form  | Meaning                                  |
|------------|-----------------------------------------|---------|------------------|------------------------------------------|
| `?`        | 0 or 1 of the preceding token           | Yes     | `??`             | as few as possible (0 or 1)              |
| `*`        | 0 or more of the preceding token        | Yes     | `*?`             | as few as possible (including zero)      |
| `+`        | 1 or more of the preceding token        | Yes     | `+?`             | as few as possible (at least one)        |
| `{n}`      | exactly n of the preceding token        | -       | -                | -                                        |
| `{n,}`     | n or more of the preceding token        | Yes     | `{n,}?`          | n or more, but as few as possible        |
| `{n,m}`    | between n and m of the preceding token  | Yes     | `{n,m}?`         | between n and m, but as few as possible  |

In [61]:
import re

text = "aaaa"

print(re.findall(r"a?", text))
print(re.findall(r"a*", text))
print(re.findall(r"a+", text))
print(re.findall(r"a{2}", text))
print(re.findall(r"a{1,3}", text))

print(f"Non-greedy a*: {re.findall(r"a*?", text)}")
print(f"Non-greedy a+: {re.findall(r"a+?", text)}")
print(f"Non-greedy a{{1,3}}?: {re.findall(r"a{1,3}?", text)}")

['a', 'a', 'a', 'a', '']
['aaaa', '']
['aaaa']
['aa', 'aa']
['aaa', 'a']
Non-greedy a*: ['', 'a', '', 'a', '', 'a', '', 'a', '']
Non-greedy a+: ['a', 'a', 'a', 'a']
Non-greedy a{1,3}?: ['a', 'a', 'a', 'a']


## Quantifiers & Greedy vs Non-Greedy
- `*` / `+` / `{n,}` are greedy: match as much as possible.  
- Append `?` (`*?` / `+?` / `{n,}?`) to make them non-greedy: match as little as possible.  
- Greedy quantifiers match the longest possible string that satisfies the pattern. Adding a `?` after them makes them non-greedy (or lazy), matching the shortest possible string.

In [68]:
import re

html = "<p>One</p><p>Two</p><></>"

print(f"Greedy: {re.findall(r"<.*>", html)}")
print(f"Non-greedy: {re.findall(r"<.*?>", html)}")

Greedy: ['<p>One</p><p>Two</p><></>']
Non-greedy: ['<p>', '</p>', '<p>', '</p>', '<>', '</>']
