### **Regular Expressions**
Regular expressions (regex) in Python are powerful tools for matching and manipulating text based on specific patterns. The re module in Python provides support for working with regular expressions. By defining patterns using a combination of literal characters and metacharacters, you can perform complex string matching, searching, and substitution operations efficiently. Understanding regular expressions is essential for tasks such as data validation, parsing, and text processing.

Common Regex Patterns and Rules:

Literal Characters: Match the exact characters in the pattern.

Example: pattern = r"abc" matches the string "abc".
Character Classes: Match any character within the brackets.

- [aeiou] matches any vowel.
- [A-Z] matches any uppercase letter.
- [0-9] matches any digit.
- [^0-9] matches any character except digits.

Shorthand Character Classes:

- \d matches any digit ([0-9]).
- \D matches any non-digit character ([^0-9]).
- \w matches any alphanumeric character ([a-zA-Z0-9_]).
- \W matches any non-alphanumeric character ([^a-zA-Z0-9_]).
- \s matches any whitespace character ([ \t\n\r\f\v]).
- \S matches any non-whitespace character ([^ \t\n\r\f\v]).

Anchors:

- ^ asserts the start of a line.
- $ asserts the end of a line.

Quantifiers:

- * matches zero or more occurrences of the preceding element.
- + matches one or more occurrences.
- ? matches zero or one occurrence.
- {n} matches exactly n occurrences.
- {n,} matches n or more occurrences.
- {n,m} matches between n and m occurrences.

Grouping and Alternation:

- Parentheses () group patterns together.
- The pipe | denotes alternation (i.e., logical OR).
- Escaping Special Characters: Use \ to escape special characters like ., *, ?, etc.

In [2]:
"""
Objective: Use a regular expression to check if a string contains the word 'Python'.
"""
import re

text = "I am learning Python programming."

# Define the pattern to search for
pattern = r"Python"

# Search for the pattern in the text
match = re.search(pattern, text)

if match:
    print("Match found!")
else:
    print("No match found.")

# TODO: Modify the pattern to make the search case-insensitive.

match2 = re.search(pattern, text, re.IGNORECASE)
if match2:
    print("Match found!")
else:
    print("No match found.")


Match found!
Match found!


In [3]:
"""
Objective: Use a regular expression to find all email addresses in a given text.
"""
import re

text = """
Please contact us at support@example.com for further information.
You can also reach out to sales@example.org or admin@example.net.
"""

# Define the email pattern
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# Find all matches in the text
emails = re.findall(pattern, text)

print("Extracted email addresses:", emails)

# TODO: Modify the pattern to exclude email addresses with certain domains (e.g., example.org).
pattern2 = r"[a-zA-Z0-9._%+-]+@(?!example\.org)[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
emails2 = re.findall(pattern2, text)

print("Extracted email addresses:", emails2)




Extracted email addresses: ['support@example.com', 'sales@example.org', 'admin@example.net']
Extracted email addresses: ['support@example.com', 'admin@example.net']


In [4]:
"""
Objective: Write a regular expression to validate phone numbers in the format (123) 456-7890.
"""
import re

phone_numbers = [
    "(123) 456-7890",
    "(987) 654-3210",
    "123-456-7890",
    "(123)456-7890"
]

# Define the phone number pattern
pattern = r"\(\d{3}\) \d{3}-\d{4}"

for number in phone_numbers:
    if re.match(pattern, number):
        print(f"{number} is a valid phone number.")
    else:
        print(f"{number} is not a valid phone number.")

# TODO: Extend the pattern to match phone numbers with an optional country code, e.g., +1 (123) 456-7890.
import re

phone_numbers = [
    "(123) 456-7890",
    "(987) 654-3210",
    "123-456-7890", 
    "(123)456-7890",
    "+1 (123) 456-7890",  # Added test case with country code
    "1 (123) 456-7890"    # Added test case without plus
]

# Define the phone number pattern with optional country code
pattern = r"(?:\+?1 )?\(\d{3}\) \d{3}-\d{4}"

for number in phone_numbers:
    if re.match(pattern, number):
        print(f"{number} is a valid phone number.")
    else:
        print(f"{number} is not a valid phone number.")


(123) 456-7890 is a valid phone number.
(987) 654-3210 is a valid phone number.
123-456-7890 is not a valid phone number.
(123)456-7890 is not a valid phone number.
(123) 456-7890 is a valid phone number.
(987) 654-3210 is a valid phone number.
123-456-7890 is not a valid phone number.
(123)456-7890 is not a valid phone number.
+1 (123) 456-7890 is a valid phone number.
1 (123) 456-7890 is a valid phone number.


In [None]:
"""
Objective: Use regular expressions to split a string by commas, semicolons, or spaces.
"""
import re

text = "apple, banana; orange grape,pear;melon"

# Define the pattern for delimiters
pattern = r"[,\s;]+"

# Split the text based on the pattern
fruits = re.split(pattern, text)

print("List of fruits:", fruits)

# TODO: Modify the pattern to also split on colons (:) and periods (.).

# Define the pattern for delimiters
pattern = r"[,\s;:.]+"

# Split the text based on the pattern
fruits = re.split(pattern, text)

print("List of fruits:", fruits)

List of fruits: ['apple', 'banana', 'orange', 'grape', 'pear', 'melon']
List of fruits: ['apple', 'banana', 'orange', 'grape', 'pear', 'melon']


In [7]:
"""
Objective: Use regular expressions to replace all occurrences of 'cat' with 'dog' in a given text.
"""
import re

text = "The cat sat on the mat. The cat is cute."

# Define the pattern to search for
pattern = r"cat"

# Replace 'cat' with 'dog'
new_text = re.sub(pattern, "dog", text)

print("Updated text:", new_text)

# TODO: Modify the pattern to replace 'cat' only when it appears as a whole word.

pattern2 = r"\bcat\b"

new_text2 = re.sub(pattern2, "dog", text)

print("Updated text:", new_text2)



Updated text: The dog sat on the mat. The dog is cute.
Updated text: The dog sat on the mat. The dog is cute.


In [None]:
"""
Objective: Write a regular expression to extract dates in the format DD/MM/YYYY from a text.
"""
import re

text = """
John's birthday is on 12/05/1990.
The project deadline is 30/09/2025.
"""

# Define the date pattern
pattern = r"\b\d{2}/\d{2}/\d{4}\b"

# Find all dates in the text
dates = re.findall(pattern, text)

print("Extracted dates:", dates)

# TODO: Modify the pattern to extract dates in the format YYYY-MM-DD as well.
pattern2 = r"\b(\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2})\b"


# Find all dates in the text
dates2 = re.findall(pattern2, text)

print("Extracted dates:", dates2)

Extracted dates: ['12/05/1990', '30/09/2025']
Extracted dates: ['12/05/1990', '30/09/2025']


In [13]:
"""
Objective: Use regular expression groups to extract the area code and main number from phone numbers.
"""
import re

phone_number = "(123) 456-7890"

# Define the pattern with groups
pattern = r"\((\d{3})\) (\d{3}-\d{4})"

# Search for the pattern in the phone number
match = re.search(pattern, phone_number)

if match:
    area_code = match.group(1)
    main_number = match.group(2)
    print("Area Code:", area_code)
    print("Main Number:", main_number)
else:
    print("No match found.")

# TODO: Modify the pattern to handle phone numbers with or without parentheses around the area code.
phone_numbers2 = [
    "(021) 437-4909",
    "021-437-4909"  # Added test case without parentheses
]
print('-------------------------')
pattern2 = r"(?:\((\d{3})\)|\b(\d{3}))\s*-?(\d{3}-\d{4})"

for phone_number2 in phone_numbers2:
    match = re.search(pattern2, phone_number2)
    if match:
        # Get area code from either group 1 (with parentheses) or group 2 (without)
        area_code = match.group(1) if match.group(1) else match.group(2)
        main_number = match.group(3)
        print(f"Phone: {phone_number2}")
        print(f"Area Code: {area_code}")
        print(f"Main Number: {main_number}\n")
    else:
        print(f"No match found for {phone_number2}\n")

Area Code: 123
Main Number: 456-7890
-------------------------
Phone: (021) 437-4909
Area Code: 021
Main Number: 437-4909

Phone: 021-437-4909
Area Code: 021
Main Number: 437-4909



In [15]:
"""
Objective: Compile a regular expression pattern for repeated use to improve efficiency.
"""
import re

texts = [
    "Error: File not found.",
    "Warning: Low disk space.",
    "Error: Access denied."
]

# Compile the pattern
pattern = re.compile(r"Error: (.+)")

for text in texts:
    match = pattern.search(text)
    if match:
        print("Error message:", match.group(1))

# TODO: Add a pattern to also capture 'Warning' messages.
print('-------------------------')
pattern = re.compile(r"(Error|Warning): (.+)")

for text in texts:
    match = pattern.search(text)
    if match:
        print(f"{match.group(1)} message: {match.group(2)}")


Error message: File not found.
Error message: Access denied.
-------------------------
Error message: File not found.
Error message: Access denied.


In [16]:
"""
Objective: Use lookahead and lookbehind assertions to find words surrounded by specific characters.
"""
import re

text = "The price is $100. The discount is 20%."

# Define the pattern with lookahead and lookbehind
pattern = r"(?<=\$)\d+"

# Find all matches in the text
prices = re.findall(pattern, text)

print("Prices found:", prices)

# TODO: Modify the pattern to also find percentages (numbers followed by '%').
print('--------------------')
pattern = r"(?<=\$)\d+|\d+(?=%)"

# Find all matches in the text
numbers = re.findall(pattern, text)

print("Numbers found:", numbers)


Prices found: ['100']
--------------------
Numbers found: ['100', '20']


In [17]:
"""
Objective: Write a regular expression to remove HTML tags from a string.
"""
import re

html = "<p>This is a <b>bold</b> paragraph.</p>"

# Define the pattern to match HTML tags
pattern = r"<.*?>"

# Remove HTML tags
clean_text = re.sub(pattern, "", html)

print("Cleaned text:", clean_text)

# TODO: Modify the pattern to handle nested tags correctly.
print('----------------------')
pattern = r"<[^>]*>|<\/[^>]*>"

# Remove HTML tags
clean_text = re.sub(pattern, "", html)

print("Cleaned text:", clean_text)


Cleaned text: This is a bold paragraph.
----------------------
Cleaned text: This is a bold paragraph.


### **Reflection**
Reflect on how regular expressions can simplify complex string processing tasks. Consider these questions:

- How do regular expressions improve the efficiency of text searching and manipulation?
- What are the potential pitfalls of using overly complex regular expressions?
- How can you ensure that your regular expressions are both efficient and maintainable?

(answer here)

- How do regular expressions improve the efficiency of text searching and manipulation?
  Regular expressions significantly improve text processing efficiency by:
  - Providing a single, powerful pattern matching operation instead of multiple string operations
  - Allowing complex search patterns to be expressed concisely
  - Enabling fast matching through optimized pattern compilation
  - Reducing the need for multiple passes through text with combined patterns
  - Supporting flexible pattern matching with metacharacters and quantifiers

- What are the potential pitfalls of using overly complex regular expressions?
  Some common pitfalls include:
  - Reduced readability and maintainability of complex patterns
  - Catastrophic backtracking leading to performance issues
  - Unintended matches due to greedy quantifiers
  - Difficulty in debugging complex patterns
  - Increased likelihood of errors in pattern syntax
  
- How can you ensure that your regular expressions are both efficient and maintainable?
  Best practices for efficient and maintainable regex:
  - Break complex patterns into smaller, well-documented components
  - Use named groups for better readability
  - Test patterns with diverse input cases
  - Comment complex patterns to explain their purpose
  - Use non-greedy quantifiers when appropriate
  - Compile frequently used patterns for better performance
  - Validate patterns against edge cases

### **Exploration**
For further exploration, research advanced regular expression features such as named groups, non-capturing groups, and recursive patterns. Additionally, explore regex performance optimization techniques

Here's an example demonstrating advanced regex features and optimization techniques:


In [None]:
"""
Advanced Regular Expression Examples
"""
import re

# Example 1: Named Groups
log_entry = "2024-03-15 14:30:45 - User 'john_doe' logged in from 192.168.1.100"
pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) - User '(?P<username>\w+)' logged in from (?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})"

match = re.search(pattern, log_entry)
if match:
    print(f"Date: {match.group('date')}")
    print(f"Time: {match.group('time')}")
    print(f"Username: {match.group('username')}")
    print(f"IP: {match.group('ip')}")

# Example 2: Non-capturing Groups
text = "rgb(255, 128, 0) and rgba(255, 128, 0, 0.5)"
pattern = r"rgb(?:a)?\((\d+),\s*(\d+),\s*(\d+)(?:,\s*[\d.]+)?\)"

for match in re.finditer(pattern, text):
    print(f"Color values: R={match.group(1)}, G={match.group(2)}, B={match.group(3)}")

# Example 3: Performance Optimization - Compiled Pattern
phone_pattern = re.compile(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b")

phone_numbers = [
    "123-456-7890",
    "123.456.7890",
    "1234567890"
]

for number in phone_numbers:
    if phone_pattern.match(number):
        print(f"Valid phone number: {number}")

# Example 4: Atomic Groups (Python 3.11+)
text = "The quick brown fox jumps over the lazy dog"
pattern = re.compile(r"\b\w+(?=\s|$)")  # Efficiently match whole words

for word in pattern.finditer(text):
    print(f"Found word: {word.group()}")

This code demonstrates:

1. Named groups ( ?P<name> ) for readable group references
2. Non-capturing groups ( (?:...) ) for better performance
3. Pattern compilation for repeated use
4. Efficient word boundary matching
5. Atomic grouping for performance optimization
Each example includes comments explaining the pattern and its purpose. These techniques help write more maintainable and efficient regex patterns.