# Capturing Groups and Back-References
- Regex lets you check for patterns, but often you need to **extract** pieces of the match (e.g., IP vs port).
- Capturing groups, defined with `()`, let you isolate and retrieve substrings from a match.
- Named groups improve readability by giving meaningful labels instead of relying on group numbers.
- Non-capturing groups `(?:…)` let you apply grouping logic without cluttering captures.
- Back-references allow you to match the same text twice (or more) within one pattern.

## Capturing Groups
- Parentheses `()` both group and **capture** the matched text inside them.
- Groups are numbered by their opening `(`, starting at 1; group 0 is the entire match.
- Use `match.group(n)` for a single group or `match.groups()` to get all captures as a tuple.
- Capturing is essential when you need to feed specific substrings into further processing.

In [33]:
import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Our goal:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(\w+)\s+User=(\w+).*?\s+IP=([\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group(1)}")
    print(f"User: {match.group(2)}")
    print(f"IP: {match.group(3)}")
    print(f"All groups: {match.groups()}")

Full match: Level=ERROR User=admin Action=login_fail IP=10.0.0.5
Level: ERROR
User: admin
IP: 10.0.0.5
All groups: ('ERROR', 'admin', '10.0.0.5')


## Named Capturing Groups
- Syntax: `(?P<name>pattern)` assigns a label to a capturing group.
- Access by name: `match.group('name')` makes code self-documenting.
- `match.groupdict()` returns a dict of all named captures.
- You can still use numeric indices if needed, but names help avoid off-by-one errors.

In [35]:
import re

log_entry = "Ts=2023-10-27T12:00:00Z Level=ERROR User=admin Action=login_fail IP=10.0.0.5"

# Add labels to:
# 1. Group 1: The log level
# 2. Group 2: The user name
# 2. Group 3: The IP address

pattern = r"Level=(?P<level>\w+)\s+User=(?P<user>\w+).*?\s+IP=(?P<ip>[\d\.]+)"

match = re.search(pattern, log_entry)

if match:
    print(f"Full match: {match.group(0)}")
    print(f"Level: {match.group("level")}")
    print(f"User: {match.group("user")}")
    print(f"IP: {match.group("ip")}")
    print(f"All groups: {match.groups()}")
    print(f"Group dictionary: {match.groupdict()}")

Full match: Level=ERROR User=admin Action=login_fail IP=10.0.0.5
Level: ERROR
User: admin
IP: 10.0.0.5
All groups: ('ERROR', 'admin', '10.0.0.5')
Group dictionary: {'level': 'ERROR', 'user': 'admin', 'ip': '10.0.0.5'}


## Non-Capturing Groups
- Use `(?:pattern)` when you need grouping for quantifiers or alternation **without** capturing.
- Keeps your capture numbers focused on what you actually want.
- Prevents unwanted `None` entries in `match.groups()` when using optional parts.

In [51]:
import re

log_line1 = "report.txt Status: OK"
log_line2 = "report Status: OK"

# Our goal:
# 1. Group 1: The stem of the filename, with .txt being an optional string
# 2. Group 2: The status code

pattern = r"^(.+?)(?:\.txt)?\s+Status:\s+(.+)$"

match_line1 = re.search(pattern, log_line1)
match_line2 = re.search(pattern, log_line2)

if match_line1: print(match_line1.groups())
if match_line2: print(match_line2.groups())

('report', 'OK')
('report', 'OK')


## Back-references
- Refer back to a previous capture using `\1`, `\2`, … or `(?P=name)` for named groups.
- Useful for matching repeated words or balanced constructs (e.g., open/close tags).
- Can make patterns more complex but powerful for advanced text validation.

In [61]:
import re

text = "This this is a test test."
pattern_numbers = r"(?i)\b(\w+)\s+\1\b"
pattern_labels = r"(?i)\b(?P<word>\w+)\s+(?P=word)\b"

print(f"Doubled words: {re.findall(pattern_numbers, text)}")
print(f"Doubled words: {re.findall(pattern_labels, text)}")

html = "<p>Paragraph</p> <b>Bold</b>"
pattern_tags = r"<(\w+)>(.*?)</\1>"

print(f"Tags: {re.findall(pattern_tags, html)}")

Doubled words: ['This', 'test']
Doubled words: ['This', 'test']
Tags: [('p', 'Paragraph'), ('b', 'Bold')]
