# Topic 23: Regular Expressions - Pattern Matching

## Overview
Regular expressions (regex) are powerful tools for pattern matching, text processing, and data extraction. Master the art of finding and manipulating text patterns.

### What You'll Learn:
- Basic regex patterns and metacharacters
- Python's re module functions
- Groups and capturing
- Advanced patterns and flags
- Real-world text processing applications
- Performance considerations

---

## 1. Basic Regular Expression Patterns

Understanding fundamental regex syntax:

In [5]:
# Basic regular expression patterns
import re

print("Basic Regular Expression Patterns:")
print("=" * 35)

# Basic pattern matching
print("1. Basic pattern matching:")

# Literal matching
text = "Hello, World! Welcome to Python programming."
pattern = "Python"

match = re.search(pattern, text)
if match:
    print(f"   Found '{pattern}' at position {match.start()}-{match.end()}")
    print(f"   Matched text: '{match.group()}'")
else:
    print(f"   Pattern '{pattern}' not found")

# Case-sensitive vs case-insensitive
print(f"\n2. Case sensitivity:")
text_mixed = "python PYTHON Python PyThOn"
pattern_case = "python"

print(f"   Text: {text_mixed}")
print(f"   Pattern: {pattern_case}")

# Case-sensitive search (default)
matches_case = re.findall(pattern_case, text_mixed)
print(f"   Case-sensitive matches: {matches_case}")

# Case-insensitive search
matches_ignore = re.findall(pattern_case, text_mixed, re.IGNORECASE)
print(f"   Case-insensitive matches: {matches_ignore}")

# Metacharacters
print(f"\n3. Basic metacharacters:")

text_meta = "The price is $123.45 and tax is 8.5%"

# . (dot) - matches any character except newline
print(f"   Text: {text_meta}")
print(f"   Pattern '.rice': {re.findall(r'.rice', text_meta)}")
print(f"   Pattern 'p..ce': {re.findall(r'p..ce', text_meta)}")

# ^ - matches start of string
start_matches = [
    ("The", re.search(r'^The', text_meta)),
    ("price", re.search(r'^price', text_meta))
]

for pattern, match in start_matches:
    result = "Found" if match else "Not found"
    print(f"   Pattern '^{pattern}': {result}")

# $ - matches end of string
end_text = "Hello World!"
end_matches = [
    ("World!", re.search(r'World!$', end_text)),
    ("Hello", re.search(r'Hello$', end_text))
]

for pattern, match in end_matches:
    result = "Found" if match else "Not found"
    print(f"   Pattern '{pattern}$': {result}")

# Character classes
print(f"\n4. Character classes:")

text_chars = "Hello123 World456 Test789!"
print(f"   Text: {text_chars}")

# \d - digits
digits = re.findall(r'\d', text_chars)
print(f"   \\d (digits): {digits}")

# \d+ - one or more digits
digit_groups = re.findall(r'\d+', text_chars)
print(f"   \\d+ (digit groups): {digit_groups}")

# \w - word characters (letters, digits, underscore)
word_chars = re.findall(r'\w', text_chars)
print(f"   \\w (word chars): {word_chars[:10]}...")  # Show first 10

# \w+ - word groups
words = re.findall(r'\w+', text_chars)
print(f"   \\w+ (words): {words}")

# \s - whitespace
whitespace = re.findall(r'\s', text_chars)
print(f"   \\s (whitespace): {repr(whitespace)}")

# Custom character classes
print(f"\n5. Custom character classes:")

vowel_text = "Education is important"
print(f"   Text: {vowel_text}")

# [aeiou] - specific characters
vowels = re.findall(r'[aeiou]', vowel_text.lower())
print(f"   [aeiou] (vowels): {vowels}")

# [a-z] - character range
lowercase = re.findall(r'[a-z]', vowel_text)
print(f"   [a-z] (lowercase): {lowercase[:10]}...")  # Show first 10

# [A-Z] - uppercase range
uppercase = re.findall(r'[A-Z]', vowel_text)
print(f"   [A-Z] (uppercase): {uppercase}")

# [0-9] - digit range (same as \d)
digit_range = re.findall(r'[0-9]', "Age: 25, Score: 98")
print(f"   [0-9] (digit range): {digit_range}")

# [^...] - negation (NOT)
non_vowels = re.findall(r'[^aeiou\s]', vowel_text.lower())
print(f"   [^aeiou\\s] (consonants): {non_vowels}")

# Quantifiers
print(f"\n6. Quantifiers:")

quant_text = "a aa aaa aaaa b bb bbb"
print(f"   Text: {quant_text}")

# * - zero or more
star_matches = re.findall(r'a*', quant_text)
print(f"   a* (zero or more a): {star_matches[:10]}...")  # Many empty matches

# + - one or more
plus_matches = re.findall(r'a+', quant_text)
print(f"   a+ (one or more a): {plus_matches}")

# ? - zero or one
question_matches = re.findall(r'a?', quant_text)
print(f"   a? (zero or one a): {question_matches[:10]}...")  # Many matches

# {n} - exactly n
exact_matches = re.findall(r'a{3}', quant_text)
print(f"   a{{3}} (exactly 3 a's): {exact_matches}")

# {n,m} - between n and m
range_matches = re.findall(r'a{2,3}', quant_text)
print(f"   a{{2,3}} (2-3 a's): {range_matches}")

# {n,} - n or more
min_matches = re.findall(r'a{2,}', quant_text)
print(f"   a{{2,}} (2 or more a's): {min_matches}")

# Practical examples
print(f"\n7. Practical pattern examples:")

# Email pattern (basic)
email_text = "Contact: john@example.com or support@company.org"
email_pattern = r'\w+@\w+\.\w+'
emails = re.findall(email_pattern, email_text)
print(f"   Text: {email_text}")
print(f"   Email pattern: {emails}")

# Phone number pattern
phone_text = "Call 123-456-7890 or (555) 123-4567"
phone_pattern = r'\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}'
phones = re.findall(phone_pattern, phone_text)
print(f"   Text: {phone_text}")
print(f"   Phone numbers: {phones}")

# URL pattern (basic)
url_text = "Visit https://www.example.com or http://test.org/path"
url_pattern = r'https?://[\w.-]+(?:/[\w.-]*)*'
urls = re.findall(url_pattern, url_text)
print(f"   Text: {url_text}")
print(f"   URLs: {urls}")

# Time pattern
time_text = "Meeting at 09:30, lunch at 12:45, finish by 17:00"
time_pattern = r'\d{1,2}:\d{2}'
times = re.findall(time_pattern, time_text)
print(f"   Text: {time_text}")
print(f"   Times: {times}")

# Greedy vs non-greedy matching
print(f"\n8. Greedy vs non-greedy:")

html_text = "<div>Hello</div> and <span>World</span>"
print(f"   HTML text: {html_text}")

# Greedy (default) - matches as much as possible
greedy_pattern = r'<.*>'
greedy_match = re.findall(greedy_pattern, html_text)
print(f"   Greedy <.*>: {greedy_match}")

# Non-greedy - matches as little as possible
non_greedy_pattern = r'<.*?>'
non_greedy_match = re.findall(non_greedy_pattern, html_text)
print(f"   Non-greedy <.*?>: {non_greedy_match}")

# Extract content between tags
content_pattern = r'<[^>]+>([^<]*)</[^>]+>'
content_matches = re.findall(content_pattern, html_text)
print(f"   Content between tags: {content_matches}")

Basic Regular Expression Patterns:
1. Basic pattern matching:
   Found 'Python' at position 25-31
   Matched text: 'Python'

2. Case sensitivity:
   Text: python PYTHON Python PyThOn
   Pattern: python
   Case-sensitive matches: ['python']
   Case-insensitive matches: ['python', 'PYTHON', 'Python', 'PyThOn']

3. Basic metacharacters:
   Text: The price is $123.45 and tax is 8.5%
   Pattern '.rice': ['price']
   Pattern 'p..ce': ['price']
   Pattern '^The': Found
   Pattern '^price': Not found
   Pattern 'World!$': Found
   Pattern 'Hello$': Not found

4. Character classes:
   Text: Hello123 World456 Test789!
   \d (digits): ['1', '2', '3', '4', '5', '6', '7', '8', '9']
   \d+ (digit groups): ['123', '456', '789']
   \w (word chars): ['H', 'e', 'l', 'l', 'o', '1', '2', '3', 'W', 'o']...
   \w+ (words): ['Hello123', 'World456', 'Test789']
   \s (whitespace): [' ', ' ']

5. Custom character classes:
   Text: Education is important
   [aeiou] (vowels): ['e', 'u', 'a', 'i', 'o', 'i', 'i', '

## 2. Python's re Module Functions

Understanding the main functions in Python's re module:

In [6]:
# Python's re module functions
import re

print("Python's re Module Functions:")
print("=" * 29)

# Sample text for demonstrations
sample_text = """John Doe: 123-456-7890, john.doe@email.com
Jane Smith: (555) 123-4567, jane@company.org  
Bob Wilson: 987.654.3210, bob_wilson@test.net
Alice Brown: 444-555-6666, alice.brown@domain.co.uk"""

print(f"Sample text:")
print(f"{sample_text}")

# re.search() - find first match
print(f"\n1. re.search() - find first match:")

phone_pattern = r'\d{3}[.-]\d{3}[.-]\d{4}'
first_phone = re.search(phone_pattern, sample_text)

if first_phone:
    print(f"   First phone found: '{first_phone.group()}'")
    print(f"   Position: {first_phone.start()}-{first_phone.end()}")
    print(f"   Full match object: {first_phone}")
else:
    print(f"   No phone numbers found")

# re.match() - match at string beginning
print(f"\n2. re.match() - match at string beginning:")

name_pattern = r'\w+\s+\w+'
start_match = re.match(name_pattern, sample_text)

if start_match:
    print(f"   Match at start: '{start_match.group()}'")
else:
    print(f"   No match at string beginning")

# Test with different starting text
test_string = "Alice Johnson is here"
match_test = re.match(name_pattern, test_string)
print(f"   Testing '{test_string[:20]}...': {match_test.group() if match_test else 'No match'}")

# re.findall() - find all matches
print(f"\n3. re.findall() - find all matches:")

# Find all phone numbers
all_phones = re.findall(phone_pattern, sample_text)
print(f"   All phones: {all_phones}")

# Find all email addresses
email_pattern = r'[\w.-]+@[\w.-]+\.[\w]+'
all_emails = re.findall(email_pattern, sample_text)
print(f"   All emails: {all_emails}")

# Find all names (assuming they're at line starts)
name_pattern_full = r'^(\w+\s+\w+):'
all_names = re.findall(name_pattern_full, sample_text, re.MULTILINE)
print(f"   All names: {all_names}")

# re.finditer() - find all matches as match objects
print(f"\n4. re.finditer() - find all match objects:")

for i, match in enumerate(re.finditer(phone_pattern, sample_text), 1):
    print(f"   Phone {i}: '{match.group()}' at position {match.start()}-{match.end()}")

# re.split() - split string by pattern
print(f"\n5. re.split() - split by pattern:")

# Split by various delimiters
mixed_delim_text = "apple,banana;orange:grape|kiwi"
delim_pattern = r'[,;:|]'
fruits = re.split(delim_pattern, mixed_delim_text)
print(f"   Text: {mixed_delim_text}")
print(f"   Split by [,;:|]: {fruits}")

# Split with limit
limited_split = re.split(delim_pattern, mixed_delim_text, maxsplit=2)
print(f"   Split with limit 2: {limited_split}")

# Split by whitespace (multiple spaces/tabs)
whitespace_text = "word1    word2\tword3   word4"
whitespace_split = re.split(r'\s+', whitespace_text)
print(f"   Text: '{whitespace_text}'")
print(f"   Split by whitespace: {whitespace_split}")

# re.sub() - substitute matches
print(f"\n6. re.sub() - substitute matches:")

# Replace phone number formats
phone_text = "Call 123-456-7890 or (555) 123-4567"
print(f"   Original: {phone_text}")

# Standardize phone format
standardized = re.sub(r'\(?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})', 
                     r'(\1) \2-\3', phone_text)
print(f"   Standardized: {standardized}")

# Replace with function
def phone_formatter(match):
    """Custom phone formatting function"""
    full_match = match.group(0)
    digits_only = re.sub(r'\D', '', full_match)
    if len(digits_only) == 10:
        return f"{digits_only[:3]}-{digits_only[3:6]}-{digits_only[6:]}"
    return full_match

custom_format = re.sub(r'\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}', phone_formatter, phone_text)
print(f"   Custom format: {custom_format}")

# Count replacements
sensitive_text = "This is SENSITIVE data with sensitive information"
cleaned_text, count = re.subn(r'sensitive', '[REDACTED]', sensitive_text, flags=re.IGNORECASE)
print(f"   Original: {sensitive_text}")
print(f"   Cleaned: {cleaned_text}")
print(f"   Replacements made: {count}")

# re.compile() - compile patterns for reuse
print(f"\n7. re.compile() - compile for performance:")

# Compile frequently used patterns
email_regex = re.compile(r'[\w.-]+@[\w.-]+\.[\w]+')
phone_regex = re.compile(r'\(?\d{3}\)?[-\s]?\d{3}[-\s]?\d{4}')

# Use compiled patterns
test_texts = [
    "Contact: alice@example.com, phone: 555-1234",
    "Email support@company.org or call (123) 456-7890",
    "No contact info in this text"
]

for text in test_texts:
    emails_found = email_regex.findall(text)
    phones_found = phone_regex.findall(text)
    print(f"   Text: '{text[:30]}...'")
    print(f"     Emails: {emails_found}")
    print(f"     Phones: {phones_found}")

# Pattern flags
print(f"\n8. Pattern flags:")

flag_text = """Hello World
Python Programming
REGEX Patterns"""

print(f"   Text: {repr(flag_text)}")

# IGNORECASE flag
ignore_case_matches = re.findall(r'python', flag_text, re.IGNORECASE)
print(f"   IGNORECASE 'python': {ignore_case_matches}")

# MULTILINE flag - ^ and $ match line boundaries
multiline_matches = re.findall(r'^\w+', flag_text, re.MULTILINE)
print(f"   MULTILINE '^\\w+': {multiline_matches}")

# DOTALL flag - . matches newlines
dotall_match = re.search(r'Hello.*Patterns', flag_text, re.DOTALL)
print(f"   DOTALL 'Hello.*Patterns': {'Found' if dotall_match else 'Not found'}")

# VERBOSE flag - allow comments and whitespace in pattern
verbose_pattern = re.compile(r"""
    \d{3}      # Area code
    [-.]       # Separator
    \d{3}      # Exchange
    [-.]       # Separator  
    \d{4}      # Number
    """, re.VERBOSE)

verbose_test = "Phone: 123-456-7890"
verbose_match = verbose_pattern.search(verbose_test)
print(f"   VERBOSE pattern match: {'Found' if verbose_match else 'Not found'}")

# Multiple flags
multi_flag_pattern = re.compile(r'^hello.*world', re.IGNORECASE | re.DOTALL | re.MULTILINE)
multi_test = "HELLO beautiful\nWORLD"
multi_match = multi_flag_pattern.search(multi_test)
print(f"   Multiple flags: {'Found' if multi_match else 'Not found'}")

# Performance comparison
print(f"\n9. Performance: compiled vs uncompiled:")
import time

test_pattern = r'\b\w{5,}\b'  # Words 5+ characters
large_text = "Python programming is awesome and powerful " * 1000

# Uncompiled pattern
start = time.time()
for _ in range(100):
    re.findall(test_pattern, large_text)
uncompiled_time = time.time() - start

# Compiled pattern
compiled_pattern = re.compile(test_pattern)
start = time.time()
for _ in range(100):
    compiled_pattern.findall(large_text)
compiled_time = time.time() - start

print(f"   Uncompiled: {uncompiled_time:.4f} seconds")
print(f"   Compiled: {compiled_time:.4f} seconds")
print(f"   Speedup: {uncompiled_time/compiled_time:.1f}x faster")

print(f"\n10. Function summary:")
print(f"   re.search(): Find first match")
print(f"   re.match(): Match at string start")
print(f"   re.findall(): Find all matches as strings")
print(f"   re.finditer(): Find all matches as match objects")
print(f"   re.split(): Split string by pattern")
print(f"   re.sub(): Replace matches")
print(f"   re.subn(): Replace matches and return count")
print(f"   re.compile(): Compile pattern for reuse")

Python's re Module Functions:
Sample text:
John Doe: 123-456-7890, john.doe@email.com
Jane Smith: (555) 123-4567, jane@company.org  
Bob Wilson: 987.654.3210, bob_wilson@test.net
Alice Brown: 444-555-6666, alice.brown@domain.co.uk

1. re.search() - find first match:
   First phone found: '123-456-7890'
   Position: 10-22
   Full match object: <re.Match object; span=(10, 22), match='123-456-7890'>

2. re.match() - match at string beginning:
   Match at start: 'John Doe'
   Testing 'Alice Johnson is her...': Alice Johnson

3. re.findall() - find all matches:
   All phones: ['123-456-7890', '987.654.3210', '444-555-6666']
   All emails: ['john.doe@email.com', 'jane@company.org', 'bob_wilson@test.net', 'alice.brown@domain.co.uk']
   All names: ['John Doe', 'Jane Smith', 'Bob Wilson', 'Alice Brown']

4. re.finditer() - find all match objects:
   Phone 1: '123-456-7890' at position 10-22
   Phone 2: '987.654.3210' at position 102-114
   Phone 3: '444-555-6666' at position 149-161

5. re.spli

## 3. Groups and Capturing

Using parentheses to capture parts of matches:

In [7]:
# Groups and capturing
import re
from datetime import datetime

print("Groups and Capturing:")
print("=" * 21)

# Basic groups
print("1. Basic groups with parentheses:")

# Extract parts of a phone number
phone_text = "Call me at (555) 123-4567 for more info"
phone_pattern = r'\((\d{3})\)\s(\d{3})-(\d{4})'

match = re.search(phone_pattern, phone_text)
if match:
    print(f"   Text: {phone_text}")
    print(f"   Full match: {match.group(0)}")
    print(f"   Area code: {match.group(1)}")
    print(f"   Exchange: {match.group(2)}")
    print(f"   Number: {match.group(3)}")
    print(f"   All groups: {match.groups()}")

# Extract email components
email_text = "Contact john.doe@company.co.uk for support"
email_pattern = r'([\w.-]+)@([\w.-]+)\.([\w]+)'

email_match = re.search(email_pattern, email_text)
if email_match:
    print(f"\n   Email: {email_match.group(0)}")
    print(f"   Username: {email_match.group(1)}")
    print(f"   Domain: {email_match.group(2)}")
    print(f"   TLD: {email_match.group(3)}")

# Multiple groups with findall
print(f"\n2. Multiple groups with findall:")

log_data = """2024-01-15 10:30:15 INFO User logged in
2024-01-15 10:35:22 ERROR Database connection failed
2024-01-15 10:40:08 WARNING Low disk space
2024-01-15 10:45:33 INFO User logged out"""

log_pattern = r'(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s(\w+)\s(.+)'
log_matches = re.findall(log_pattern, log_data)

print(f"   Log entries:")
for date, time, level, message in log_matches:
    print(f"   {date} at {time}: [{level}] {message}")

# Named groups
print(f"\n3. Named groups (?P<name>pattern):")

# More readable group names
date_pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
date_text = "Today's date is 2024-03-15"

date_match = re.search(date_pattern, date_text)
if date_match:
    print(f"   Text: {date_text}")
    print(f"   Year: {date_match.group('year')}")
    print(f"   Month: {date_match.group('month')}")
    print(f"   Day: {date_match.group('day')}")
    print(f"   Full date: {date_match.group(0)}")
    print(f"   Group dict: {date_match.groupdict()}")

# Complex named groups example
url_pattern = r'(?P<protocol>https?)://(?P<domain>[\w.-]+)(?P<port>:\d+)?(?P<path>/[\w/-]*)?(?P<query>\?[\w=&-]*)?'
test_urls = [
    "https://www.example.com/path/to/page?id=123&type=test",
    "http://localhost:8080/admin",
    "https://api.service.com/v1/users"
]

print(f"\n   URL parsing with named groups:")
for url in test_urls:
    match = re.search(url_pattern, url)
    if match:
        print(f"   URL: {url}")
        for name, value in match.groupdict().items():
            if value:
                print(f"     {name}: {value}")
        print()

# Non-capturing groups
print(f"4. Non-capturing groups (?:pattern):")

# Group for alternation without capturing
file_pattern_capturing = r'(\w+)\.(jpg|png|gif)'  # Captures extension
file_pattern_non_capturing = r'(\w+)\.(?:jpg|png|gif)'  # Doesn't capture extension

file_text = "images: photo.jpg, icon.png, banner.gif"

capturing_matches = re.findall(file_pattern_capturing, file_text)
non_capturing_matches = re.findall(file_pattern_non_capturing, file_text)

print(f"   Text: {file_text}")
print(f"   Capturing groups: {capturing_matches}")
print(f"   Non-capturing groups: {non_capturing_matches}")

# Backreferences
print(f"\n5. Backreferences \\1, \\2, etc.:")

# Find repeated words
repeated_word_text = "This is is a test test with repeated words words"
repeated_pattern = r'\b(\w+)\s+\1\b'

repeated_matches = re.findall(repeated_pattern, repeated_word_text)
print(f"   Text: {repeated_word_text}")
print(f"   Repeated words: {repeated_matches}")

# Find and highlight repeated words
highlighted = re.sub(repeated_pattern, r'[\1-REPEATED]', repeated_word_text)
print(f"   Highlighted: {highlighted}")

# HTML tag matching with backreferences
html_text = "<div>Content</div> and <span>More content</span> but <p>Unclosed div"
matching_tags_pattern = r'<(\w+)>([^<]*)</\1>'

matching_tags = re.findall(matching_tags_pattern, html_text)
print(f"\n   HTML: {html_text}")
print(f"   Matching tags: {matching_tags}")

# Conditional groups
print(f"\n6. Conditional patterns:")

# Optional groups
optional_pattern = r'\d{3}(-?)\d{3}\1\d{4}'  # Consistent separator
phone_tests = ["123-456-7890", "1234567890", "123.456.7890", "123-456.7890"]

print(f"   Testing consistent separator pattern:")
for phone in phone_tests:
    match = re.search(optional_pattern, phone)
    result = "Match" if match else "No match"
    print(f"   {phone}: {result}")

# Lookahead and lookbehind
print(f"\n7. Lookahead and lookbehind:")

# Positive lookahead (?=pattern)
password_text = "MyPassword123"
password_with_digit = r'\w*(?=\d)\w*'  # Words containing digits

digit_match = re.search(password_with_digit, password_text)
print(f"   Text: {password_text}")
print(f"   Contains digit: {'Yes' if digit_match else 'No'}")

# Extract words followed by specific punctuation
text_with_punct = "Hello, World! How are you? Fine, thanks."
words_before_punct = re.findall(r'\w+(?=[,.!?])', text_with_punct)
print(f"   Text: {text_with_punct}")
print(f"   Words before punctuation: {words_before_punct}")

# Negative lookahead (?!pattern)
filename_text = "script.py config.txt data.csv backup.py"
non_python_files = re.findall(r'\w+\.(?!py)\w+', filename_text)
print(f"   Files: {filename_text}")
print(f"   Non-Python files: {non_python_files}")

# Lookbehind examples
currency_text = "Price: $19.99, Cost: €25.50, Tax: £5.00"
currency_amounts = re.findall(r'(?<=[\$€£])\d+\.\d{2}', currency_text)
print(f"   Currency text: {currency_text}")
print(f"   Amounts: {currency_amounts}")

# Complex grouping example: parsing CSV-like data
print(f"\n8. Complex parsing example:")

csv_data = '''"John Doe",25,"Software Engineer","New York"
"Jane Smith",30,"Data Scientist","San Francisco"
"Bob Wilson",28,"Designer","Los Angeles"'''

# Pattern to parse CSV with quoted fields
csv_pattern = r'"([^"]*)",([^,]+),"([^"]*)","([^"]*)"'
csv_matches = re.findall(csv_pattern, csv_data)

print(f"   CSV data parsing:")
for name, age, job, city in csv_matches:
    print(f"   Name: {name}, Age: {age}, Job: {job}, City: {city}")

# Extract and transform with groups
print(f"\n9. Transform with groups:")

american_dates = "Event on 03/15/2024 and meeting on 12/25/2024"
date_transform_pattern = r'(\d{2})/(\d{2})/(\d{4})'

# Transform MM/DD/YYYY to DD-MM-YYYY
european_dates = re.sub(date_transform_pattern, r'\2-\1-\3', american_dates)
print(f"   American format: {american_dates}")
print(f"   European format: {european_dates}")

# Advanced substitution with function
def date_converter(match):
    """Convert date format and validate"""
    month, day, year = match.groups()
    try:
        # Validate date
        date_obj = datetime(int(year), int(month), int(day))
        # Return formatted date
        return date_obj.strftime("%d %B %Y")
    except ValueError:
        return match.group(0)  # Return original if invalid

readable_dates = re.sub(date_transform_pattern, date_converter, american_dates)
print(f"   Readable format: {readable_dates}")

print(f"\n10. Groups summary:")
print(f"   (pattern): Capturing group")
print(f"   (?P<name>pattern): Named capturing group")
print(f"   (?:pattern): Non-capturing group")
print(f"   \\1, \\2: Backreferences to captured groups")
print(f"   (?=pattern): Positive lookahead")
print(f"   (?!pattern): Negative lookahead")
print(f"   (?<=pattern): Positive lookbehind")
print(f"   (?<!pattern): Negative lookbehind")

Groups and Capturing:
1. Basic groups with parentheses:
   Text: Call me at (555) 123-4567 for more info
   Full match: (555) 123-4567
   Area code: 555
   Exchange: 123
   Number: 4567
   All groups: ('555', '123', '4567')

   Email: john.doe@company.co.uk
   Username: john.doe
   Domain: company.co
   TLD: uk

2. Multiple groups with findall:
   Log entries:
   2024-01-15 at 10:30:15: [INFO] User logged in
   2024-01-15 at 10:35:22: [ERROR] Database connection failed
   2024-01-15 at 10:45:33: [INFO] User logged out

3. Named groups (?P<name>pattern):
   Text: Today's date is 2024-03-15
   Year: 2024
   Month: 03
   Day: 15
   Full date: 2024-03-15
   Group dict: {'year': '2024', 'month': '03', 'day': '15'}

   URL parsing with named groups:
   URL: https://www.example.com/path/to/page?id=123&type=test
     protocol: https
     domain: www.example.com
     path: /path/to/page
     query: ?id=123&type=test

   URL: http://localhost:8080/admin
     protocol: http
     domain: localhost

## 4. Real-World Applications

Practical regex applications for common tasks:

In [8]:
# Real-world regex applications
import re
import json
from collections import Counter

print("Real-World Regex Applications:")
print("=" * 31)

# Email validation
print("1. Email validation:")

def validate_email(email):
    """Comprehensive email validation"""
    # More comprehensive email pattern
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

test_emails = [
    "valid@example.com",
    "user.name+tag@domain.co.uk", 
    "invalid.email",
    "@invalid.com",
    "valid123@test-domain.org",
    "spaces @invalid.com",
    "toolongextension@example.toolongextension"
]

print(f"   Email validation results:")
for email in test_emails:
    is_valid = validate_email(email)
    status = "✓" if is_valid else "✗"
    print(f"   {status} {email}")

# Phone number normalization
print(f"\n2. Phone number normalization:")

def normalize_phone(phone):
    """Normalize various phone number formats"""
    # Extract digits only
    digits = re.sub(r'\D', '', phone)
    
    if len(digits) == 10:
        return f"({digits[:3]}) {digits[3:6]}-{digits[6:]}"
    elif len(digits) == 11 and digits[0] == '1':
        return f"+1 ({digits[1:4]}) {digits[4:7]}-{digits[7:]}"
    else:
        return "Invalid phone number"

phone_formats = [
    "555-123-4567",
    "(555) 123-4567",
    "5551234567",
    "1-555-123-4567",
    "+1 (555) 123-4567",
    "555.123.4567",
    "12345"  # Invalid
]

print(f"   Phone normalization:")
for phone in phone_formats:
    normalized = normalize_phone(phone)
    print(f"   {phone:20} -> {normalized}")

# Log file analysis
print(f"\n3. Log file analysis:")

log_content = """2024-01-15 10:30:15 INFO [UserService] User 'alice' logged in from 192.168.1.100
2024-01-15 10:35:22 ERROR [DatabaseService] Connection timeout to db.example.com:5432
2024-01-15 10:40:08 WARNING [MemoryManager] Memory usage at 85%
2024-01-15 10:45:33 INFO [UserService] User 'bob' logged out
2024-01-15 10:50:45 ERROR [ApiService] HTTP 500 error on /api/users endpoint
2024-01-15 10:55:12 INFO [UserService] User 'charlie' logged in from 10.0.0.50"""

def analyze_logs(log_text):
    """Extract structured information from logs"""
    # Comprehensive log pattern
    pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2})\s(?P<level>\w+)\s\[(?P<service>\w+)\]\s(?P<message>.+)'
    
    matches = re.finditer(pattern, log_text)
    entries = []
    
    for match in matches:
        entry = match.groupdict()
        
        # Extract additional info from message
        if 'logged in' in entry['message']:
            user_match = re.search(r"User '(\w+)'", entry['message'])
            ip_match = re.search(r'from ([\d.]+)', entry['message'])
            entry['action'] = 'login'
            entry['user'] = user_match.group(1) if user_match else None
            entry['ip'] = ip_match.group(1) if ip_match else None
        
        elif 'logged out' in entry['message']:
            user_match = re.search(r"User '(\w+)'", entry['message'])
            entry['action'] = 'logout'
            entry['user'] = user_match.group(1) if user_match else None
        
        entries.append(entry)
    
    return entries

log_entries = analyze_logs(log_content)

print(f"   Log analysis results:")
for entry in log_entries:
    print(f"   {entry['timestamp']} [{entry['level']}] {entry['service']}: {entry['message'][:50]}...")
    if 'user' in entry and entry['user']:
        print(f"     -> User: {entry['user']}, Action: {entry.get('action', 'N/A')}")

# Count log levels
log_levels = Counter(entry['level'] for entry in log_entries)
print(f"\n   Log level summary: {dict(log_levels)}")

# Web scraping - extract links
print(f"\n4. HTML link extraction:")

html_content = """
<html>
<body>
    <a href="https://www.example.com">Example Site</a>
    <a href="/internal/page">Internal Link</a>
    <a href="mailto:contact@example.com">Email Us</a>
    <a href="https://api.service.com/v1/data">API Endpoint</a>
    <img src="image.jpg" alt="Image">
    <link rel="stylesheet" href="styles.css">
</body>
</html>
"""

def extract_links(html):
    """Extract different types of links from HTML"""
    results = {
        'external_links': [],
        'internal_links': [],
        'email_links': [],
        'images': [],
        'stylesheets': []
    }
    
    # External HTTP/HTTPS links
    external_pattern = r'<a\s+href="(https?://[^"]+)"[^>]*>([^<]+)</a>'
    results['external_links'] = re.findall(external_pattern, html)
    
    # Internal links (starting with /)
    internal_pattern = r'<a\s+href="(/[^"]+)"[^>]*>([^<]+)</a>'
    results['internal_links'] = re.findall(internal_pattern, html)
    
    # Email links
    email_pattern = r'<a\s+href="mailto:([^"]+)"[^>]*>([^<]+)</a>'
    results['email_links'] = re.findall(email_pattern, html)
    
    # Images
    img_pattern = r'<img\s+src="([^"]+)"[^>]*alt="([^"]+)"[^>]*>'
    results['images'] = re.findall(img_pattern, html)
    
    # Stylesheets
    css_pattern = r'<link\s+rel="stylesheet"\s+href="([^"]+)"[^>]*>'
    results['stylesheets'] = re.findall(css_pattern, html)
    
    return results

links = extract_links(html_content)
print(f"   Link extraction results:")
for link_type, link_list in links.items():
    if link_list:
        print(f"   {link_type.replace('_', ' ').title()}:")
        for link in link_list:
            if isinstance(link, tuple):
                print(f"     URL: {link[0]}, Text: {link[1]}")
            else:
                print(f"     {link}")

# Data cleaning and validation
print(f"\n5. Data cleaning and validation:")

def clean_and_validate_data(raw_data):
    """Clean and validate mixed data"""
    results = {
        'cleaned_data': [],
        'errors': []
    }
    
    for i, item in enumerate(raw_data):
        cleaned_item = {}
        
        # Clean name (remove extra spaces, capitalize)
        if 'name' in item:
            name = re.sub(r'\s+', ' ', item['name'].strip())
            if re.match(r'^[a-zA-Z\s]+$', name):
                cleaned_item['name'] = name.title()
            else:
                results['errors'].append(f"Row {i}: Invalid name format")
        
        # Validate and clean email
        if 'email' in item:
            email = item['email'].strip().lower()
            if re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', email):
                cleaned_item['email'] = email
            else:
                results['errors'].append(f"Row {i}: Invalid email format")
        
        # Clean phone number
        if 'phone' in item:
            phone = re.sub(r'\D', '', item['phone'])
            if len(phone) == 10:
                cleaned_item['phone'] = f"({phone[:3]}) {phone[3:6]}-{phone[6:]}"
            else:
                results['errors'].append(f"Row {i}: Invalid phone number")
        
        # Validate age
        if 'age' in item:
            age_match = re.search(r'\d+', str(item['age']))
            if age_match:
                age = int(age_match.group())
                if 0 <= age <= 120:
                    cleaned_item['age'] = age
                else:
                    results['errors'].append(f"Row {i}: Age out of range")
            else:
                results['errors'].append(f"Row {i}: Invalid age format")
        
        if cleaned_item:  # Only add if some data was cleaned successfully
            results['cleaned_data'].append(cleaned_item)
    
    return results

# Sample messy data
messy_data = [
    {'name': 'john   doe', 'email': 'JOHN@EXAMPLE.COM', 'phone': '555-123-4567', 'age': '25'},
    {'name': 'jane smith123', 'email': 'invalid-email', 'phone': '5551234567', 'age': 30},
    {'name': '  alice brown  ', 'email': 'alice@company.org', 'phone': '(555) 987-6543', 'age': '200'},
    {'name': 'bob wilson', 'email': 'bob@test.com', 'phone': '12345', 'age': 'twenty-five'}
]

cleaning_results = clean_and_validate_data(messy_data)

print(f"   Data cleaning results:")
print(f"   Cleaned records: {len(cleaning_results['cleaned_data'])}")
for record in cleaning_results['cleaned_data']:
    print(f"     {record}")

print(f"\n   Errors encountered:")
for error in cleaning_results['errors']:
    print(f"     {error}")

# Text analysis - extract mentions and hashtags
print(f"\n6. Social media text analysis:")

social_media_posts = [
    "Great meeting with @john_doe and @jane_smith! #productivity #teamwork",
    "Check out this article: https://example.com/article #tech #innovation",
    "Thanks to @alice for the help with #python programming! 🐍",
    "Looking forward to #conference2024 @tech_event #networking"
]

def analyze_social_text(posts):
    """Extract social media elements"""
    analysis = {
        'mentions': Counter(),
        'hashtags': Counter(),
        'urls': [],
        'emojis': []
    }
    
    for post in posts:
        # Extract mentions (@username)
        mentions = re.findall(r'@([\w_]+)', post)
        analysis['mentions'].update(mentions)
        
        # Extract hashtags (#tag)
        hashtags = re.findall(r'#([\w_]+)', post)
        analysis['hashtags'].update(hashtags)
        
        # Extract URLs
        urls = re.findall(r'https?://[\w.-]+(?:/[\w.-]*)*', post)
        analysis['urls'].extend(urls)
        
        # Extract emojis (basic Unicode detection)
        emojis = re.findall(r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF]', post)
        analysis['emojis'].extend(emojis)
    
    return analysis

social_analysis = analyze_social_text(social_media_posts)

print(f"   Social media analysis:")
print(f"   Top mentions: {dict(social_analysis['mentions'].most_common(3))}")
print(f"   Top hashtags: {dict(social_analysis['hashtags'].most_common(3))}")
print(f"   URLs found: {social_analysis['urls']}")
print(f"   Emojis: {social_analysis['emojis']}")

print(f"\n7. Regex best practices for real-world use:")
print(f"   ✓ Validate inputs before processing")
print(f"   ✓ Use specific patterns rather than overly broad ones")
print(f"   ✓ Compile patterns for repeated use")
print(f"   ✓ Handle edge cases and invalid inputs")
print(f"   ✓ Test patterns with various input formats")
print(f"   ✓ Document complex patterns with comments")
print(f"   ⚠️  Remember: regex isn't always the best tool")
print(f"   ⚠️  For HTML/XML parsing, consider dedicated parsers")
print(f"   ⚠️  Be careful with user-provided regex patterns")

Real-World Regex Applications:
1. Email validation:
   Email validation results:
   ✓ valid@example.com
   ✓ user.name+tag@domain.co.uk
   ✗ invalid.email
   ✗ @invalid.com
   ✓ valid123@test-domain.org
   ✗ spaces @invalid.com
   ✓ toolongextension@example.toolongextension

2. Phone number normalization:
   Phone normalization:
   555-123-4567         -> (555) 123-4567
   (555) 123-4567       -> (555) 123-4567
   5551234567           -> (555) 123-4567
   1-555-123-4567       -> +1 (555) 123-4567
   +1 (555) 123-4567    -> +1 (555) 123-4567
   555.123.4567         -> (555) 123-4567
   12345                -> Invalid phone number

3. Log file analysis:
   Log analysis results:
   2024-01-15 10:30:15 [INFO] UserService: User 'alice' logged in from 192.168.1.100...
     -> User: alice, Action: login
   2024-01-15 10:35:22 [ERROR] DatabaseService: Connection timeout to db.example.com:5432...
   2024-01-15 10:45:33 [INFO] UserService: User 'bob' logged out...
     -> User: bob, Action: logo

## Summary

In this notebook, you learned about:

✅ **Basic Patterns**: Metacharacters, character classes, quantifiers  
✅ **re Module Functions**: search(), findall(), sub(), compile() and their uses  
✅ **Groups and Capturing**: Parentheses, named groups, backreferences  
✅ **Advanced Features**: Lookahead/lookbehind, flags, non-capturing groups  
✅ **Real-World Applications**: Email validation, log analysis, data cleaning  
✅ **Performance**: Compiled patterns and optimization techniques  

### Key Takeaways:
1. Regular expressions are powerful for pattern matching and text processing
2. Use specific patterns rather than overly broad ones
3. Compile patterns with re.compile() for repeated use
4. Named groups make complex patterns more readable
5. Test regex patterns thoroughly with edge cases
6. Consider alternatives like dedicated parsers for complex formats

### Next Topic: 24_context_managers.ipynb
Learn about context managers and the 'with' statement for resource management.