# Regular Expressions Exercises

## Overview

This notebook contains hands-on exercises to practice and reinforce your understanding of Regular Expressions (Regex). Through these exercises, you'll apply regex patterns to solve real-world text processing problems, from basic character matching to complex pattern extraction. These exercises will help you build the practical skills needed for data cleaning, preprocessing, and feature extraction in NLP pipelines.

## Objectives

- Practice writing regex patterns for common text matching scenarios
- Apply character classes, quantifiers, and anchors to solve problems
- Extract and manipulate text using regex groups and capturing
- Translate natural language requirements into regex patterns
- Debug and refine regex patterns using practical examples

## Outline

1. **Exercise Set 1: The Basics** - Character classes and simple patterns
2. **Exercise Set 2: Quantifiers and Repetition** - Matching multiple occurrences
3. **Exercise Set 3: Anchors and Boundaries** - Position-based matching
4. **Exercise Set 4: Groups and Capturing** - Extracting specific parts of matches
5. **Exercise Set 5: Real-world Applications** - Practical text processing scenarios

In [None]:
# %pip install regex==2024.5.15 pandas==2.3.3 --quiet

In [None]:
# Standard library imports
# (none needed for this notebook)

# Third-party imports
import regex as re
import pandas as pd

## Exercise Set 1: The Basics

### Exercise 1: Character Class Basics (solved)

Write a regex pattern to match:
1. Any vowel (a, e, i, o, u) in a string
2. Any character that is NOT a digit (0-9)

In [None]:
# Exercise 1: Your solution here
texts = ["Hello World 123", "Python 3.9", "Regex is fun!"]

# Pattern 1: Match any vowel
pattern1 = r"[aeiouAEIOU]"
print("Vowels found:")
for text in texts:
    matches = re.findall(pattern1, text)
    print(f"  {text}: {matches}")

# Pattern 2: Match any non-digit character
pattern2 = r"[^0-9]"
print("\nNon-digit characters found:")
for text in texts:
    matches = re.findall(pattern2, text)
    print(f"  {text}: {matches[:10]}...")  # Show first 10 matches

### Exercise 2: Character Ranges

Create patterns to match:
1. All lowercase letters from 'a' to 'm'
2. All digits from 5 to 9
3. All uppercase letters from 'N' to 'Z'

In [None]:
# Test data
text = "The Quick Brown Fox Jumps Over 123 Lazy Dogs"

# Pattern 1: Lowercase a-m
### YOUR CODE HERE ###

# Pattern 2: Digits 5-9
### YOUR CODE HERE ###

# Pattern 3: Uppercase N-Z
### YOUR CODE HERE ###

### Exercise 3: Using \d, \w, and \s

Write patterns to:
1. Extract all phone numbers (sequences of digits) from a text
2. Find all words that start with 't' or 'T'
3. Extract all sequences of whitespace

In [None]:
# Test data
text = "Call me at 123-456-7890 or 9876543210 today!"

# Pattern 1: Extract phone numbers (sequences of digits)
### YOUR CODE HERE ###

# Pattern 2: Words starting with 't' or 'T'
### YOUR CODE HERE ###

# Pattern 3: Sequences of whitespace
### YOUR CODE HERE ###

### Exercise 4: Word Boundaries

Create patterns to:
1. Match the word "the" only as a complete word (not part of "there" or "other")
2. Find all words that end with "ing"
3. Match "cat" but not "category" or "scatter"

In [None]:
# Test data
text = "The cat is scattering. There is nothing interesting about this category."

# Pattern 1: Match "the" as complete word
### YOUR CODE HERE ###

# Pattern 2: Words ending with "ing"
### YOUR CODE HERE ###

# Pattern 3: Match "cat" but not "category" or "scatter"
### YOUR CODE HERE ###

### Exercise 5: Quantifiers

Write patterns to match:
1. Exactly 3 consecutive digits
2. One or more letters followed by zero or more digits
3. Between 2 and 4 consecutive vowels

In [None]:
# Test data
text = "abc123 def4567 ghi jklmnop 12345 aeiou"

# Pattern 1: Exactly 3 consecutive digits
### YOUR CODE HERE ###

# Pattern 2: One or more letters followed by zero or more digits
### YOUR CODE HERE ###

# Pattern 3: Between 2 and 4 consecutive vowels
### YOUR CODE HERE ###

### Exercise 6: Optional and Repetition

Create patterns to:
1. Match decimal numbers (with optional decimal part)
2. Match email-like patterns (word@word.word format)
3. Match words that may or may not have an 's' at the end

In [None]:
# Test data
text = "Price is 99.99 or 100. Contact admin@site.com or users@example.org for help."

# Pattern 1: Decimal numbers (optional decimal part)
### YOUR CODE HERE ###

# Pattern 2: Email-like patterns
### YOUR CODE HERE ###

# Pattern 3: Words with optional 's' at the end
### YOUR CODE HERE ###

### Exercise 7: Using match(), search(), findall(), and finditer()

1. Use `match()` to check if the string starts with "Python"
2. Use `search()` to find the first occurrence of a version number
3. Use `findall()` to extract all version numbers
4. Use `finditer()` to get Match objects with positions for all "Python" occurrences

In [None]:
# Test data
text = "Python 3.10 and Python 2.7 are versions."

# 1. Check if string starts with "Python"
### YOUR CODE HERE ###

# 2. Find first version number
### YOUR CODE HERE ###

# 3. Extract all version numbers
### YOUR CODE HERE ###

# 4. Finditer for "Python" with positions
### YOUR CODE HERE ###

### Exercise 8: Match Object Attributes

Extract information from matches:
1. Find all numbers in the text
2. For each match, print: the matched text, start position, end position, and span
3. Count how many numbers are in the text

In [None]:
# Test data
text = "I have 5 apples and 12 oranges"

# Pattern to find numbers
### YOUR CODE HERE ###

# Extract match information
### YOUR CODE HERE ###

### Exercise 9: Extracting Groups

Extract components from:
1. Dates in format "YYYY-MM-DD" - extract year, month, and day separately
2. Time in format "HH:MM:SS" - extract hours, minutes, and seconds
3. Phone numbers in format "(XXX) XXX-XXXX" - extract area code, exchange, and number

In [None]:
# Test data
text = "Meeting on 2024-03-15 at 14:30:00. Call (555) 123-4567"

# 1. Extract date components (year, month, day)
### YOUR CODE HERE ###

# 2. Extract time components (hours, minutes, seconds)
### YOUR CODE HERE ###

# 3. Extract phone number components (area code, exchange, number)
### YOUR CODE HERE ###

### Exercise 10: Nested Groups

Extract information from nested groups:
1. Parse product name and price separately
2. Parse name and age
3. Understand group numbering in nested patterns like "((word) number)"

In [None]:
# Test data
text = "Product: Laptop, Price: $999.99 and Name: John Doe, Age: 30"

# 1. Extract product and price
### YOUR CODE HERE ###

# 2. Extract name and age (with nested groups to capture first and last name separately)
### YOUR CODE HERE ###

### Exercise 11: Named Groups

Use named groups to extract:
1. Email addresses - extract username and domain separately
2. URLs - extract protocol, domain, and path
3. Full names - extract first name, middle name (if present), and last name

In [None]:
# Test data
text = "Contact admin@example.com or visit https://www.example.com/path for info. Name: John Michael Doe"

# 1. Extract email with named groups
### YOUR CODE HERE ###

# 2. Extract URL components
### YOUR CODE HERE ###

# 3. Extract full name (with optional middle name)
### YOUR CODE HERE ###

### Exercise 12: Named Groups Practice

Extract structured data using named groups:
1. Parse date and time components with named groups
2. Parse IP address (4 octets) and port
3. Use groupdict() to get all named groups as a dictionary

In [None]:
# Test data
text = "Date: 2024-03-15, Time: 14:30 and IP: 192.168.1.1, Port: 8080"

# 1. Parse date and time with named groups
### YOUR CODE HERE ###

# 2. Parse IP and port
### YOUR CODE HERE ###

## Exercise Set 2: Processing in Pandas

In [None]:
import pandas as pd

### Exercise 1: Extracting Numerical Values from Text

A common task in data cleaning is extracting numerical values from text columns. For example, you might have ratings written as "3/10" or "8 out of 10", or prices mixed with text.

**Task**: Extract numerical values from a DataFrame column containing mixed text and numbers.

In [None]:
# Sample data: Product reviews with ratings in various formats
df = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Keyboard'],
    'review_text': [
        'Great product! Rating: 4/5',
        'Love it! 9 out of 10',
        'Average quality. Score: 3/10',
        'Excellent! Rated 5 stars',
        'Not bad, 7.5/10'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract the first numerical rating (e.g., "4" from "4/5", "9" from "9 out of 10")
# Hint: Look for patterns like "number/number" or "number out of number"
### YOUR CODE HERE ###

# Task 2: Extract both numerator and denominator from "X/Y" format
# Create new columns: 'rating_numerator' and 'rating_denominator'
### YOUR CODE HERE ###

# Task 3: Extract decimal ratings (e.g., "7.5" from "7.5/10")
### YOUR CODE HERE ###

### Exercise 2: Cleaning and Standardizing Text Data

Real-world data often contains inconsistent formatting. You need to clean phone numbers, emails, or other structured data that appears in various formats.

**Task**: Clean and standardize phone numbers and email addresses in a DataFrame.

In [None]:
# Sample data: Customer contact information with inconsistent formatting
df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'contact_info': [
        'Phone: (555) 123-4567, Email: john@example.com',
        'Call me at 555.123.4567 or email: jane.doe@test.org',
        'Contact: 5551234567, jane@company.co.uk',
        'Phone: 555-123-4567 | Email: contact@site.net',
        'Tel: (555)123-4567, mail: info@domain.com'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract phone numbers in standard format (XXX) XXX-XXXX
# Hint: Handle various formats like (555) 123-4567, 555.123.4567, 555-123-4567, 5551234567
### YOUR CODE HERE ###

# Task 2: Extract email addresses
### YOUR CODE HERE ###

# Task 3: Create a cleaned DataFrame with separate 'phone' and 'email' columns
### YOUR CODE HERE ###

### Exercise 3: Extracting Structured Information from Unstructured Text

In many datasets, important information is embedded in free-form text. You need to extract dates, prices, or other structured data.

**Task**: Extract dates, prices, and product names from customer order descriptions.

In [None]:
# Sample data: Order descriptions with embedded information
df = pd.DataFrame({
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'description': [
        'Order placed on 2024-03-15. Product: Laptop, Price: $999.99',
        'Purchase date: 2024-03-20. Item: Wireless Mouse, Cost: $29.99',
        'Ordered on 2024-04-01. Product: USB-C Cable, Price: $15.50',
        'Date: 2024-04-10. Product: Monitor Stand, Amount: $79.99',
        'Order 2024-04-15. Product: Keyboard, Total: $89.00'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract dates in YYYY-MM-DD format
# Create a new column 'order_date'
### YOUR CODE HERE ###

# Task 2: Extract prices (handle formats like $999.99, $29.99, etc.)
# Create a new column 'price' (as float, without the $ sign)
### YOUR CODE HERE ###

# Task 3: Extract product names (text after "Product:" or "Item:")
# Create a new column 'product_name'
### YOUR CODE HERE ###

# Display the cleaned DataFrame
### YOUR CODE HERE ###

### Exercise 4: Using str.extract() and str.extractall() with Named Groups

Pandas provides convenient methods for applying regex to DataFrame columns. Use `str.extract()` for single matches and `str.extractall()` for multiple matches per row.

**Task**: Extract multiple pieces of information from text using pandas string methods.

In [None]:
# Sample data: Log entries with IP addresses, timestamps, and status codes
df = pd.DataFrame({
    'log_entry': [
        '192.168.1.1 - [2024-03-15 14:30:00] GET /page1 HTTP/1.1 200',
        '10.0.0.5 - [2024-03-15 15:45:22] POST /api/data HTTP/1.1 404',
        '172.16.0.10 - [2024-03-16 09:12:33] GET /page2 HTTP/1.1 200',
        '192.168.1.2 - [2024-03-16 10:20:15] GET /page3 HTTP/1.1 500',
        '10.0.0.8 - [2024-03-16 11:05:44] POST /api/user HTTP/1.1 201'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Extract IP address, timestamp, HTTP method, endpoint, and status code
# Use named groups and str.extract() to create separate columns
# Format: IP - [TIMESTAMP] METHOD /endpoint HTTP/version STATUS
### YOUR CODE HERE ###

# Task 2: Filter rows where status code is an error (4xx or 5xx)
### YOUR CODE HERE ###

# Task 3: Extract all IP addresses from a column that may contain multiple IPs per row
# Use str.extractall() if there can be multiple matches
df_multiple = pd.DataFrame({
    'text': [
        'IPs: 192.168.1.1, 10.0.0.5, 172.16.0.1',
        'Contact: 192.168.1.2 or 10.0.0.8',
        'No IPs here',
        'Single IP: 192.168.1.3'
    ]
})
### YOUR CODE HERE ###

### Exercise 5: Data Cleaning with str.replace() and str.contains()

Use regex with pandas string methods to clean data, filter rows, and transform text columns.

**Task**: Clean product descriptions, remove unwanted characters, and filter based on patterns.

In [None]:
# Sample data: Product catalog with messy descriptions
df = pd.DataFrame({
    'product_id': [1, 2, 3, 4, 5, 6],
    'description': [
        'Laptop - $999.99 (In Stock)',
        'Phone @ $599.99 [Available]',
        'Tablet: $399.99 {Limited Stock}',
        'Headphones - $149.99 (Out of Stock)',
        'Keyboard $79.99 [In Stock]',
        'Mouse: $29.99 {Available}'
    ],
    'category': [
        'Electronics - Computers',
        'Electronics/Mobile',
        'Electronics.Tablets',
        'Audio-Equipment',
        'Electronics\\Keyboards',
        'Electronics|Mice'
    ]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*50 + "\n")

# Task 1: Remove special characters from descriptions (keep only alphanumeric, spaces, and basic punctuation)
# Create a 'clean_description' column
### YOUR CODE HERE ###

# Task 2: Extract prices and create a numeric 'price' column
### YOUR CODE HERE ###

# Task 3: Standardize category separators (replace -, /, ., \\, | with a single separator like " > ")
### YOUR CODE HERE ###

# Task 4: Filter products that are in stock (contain "In Stock" or "Available")
### YOUR CODE HERE ###

# Task 5: Remove HTML-like tags or brackets (e.g., remove [Available], (In Stock), {Limited Stock})
### YOUR CODE HERE ###

## Exercise Set 3: Organizing the Messy Exercise Log

**Task**: Extract the exercises and the number of sets from this text.
If the exercise has a weight, extract the weight and unify the unit of measurement (either in kg or lbs).

**Log Text**:

```{python}
text = """
Pushups 30 reps 3 sets
5 reps 2 sets Pullups
2 Sets 15 Reps One-leg Squats
4 sets 8 reps 22.5 lbs Dumbbell Rows
4 sets 8 reps 15.25kg Dumbbell Rows
"""

```

## Key Takeaways

- **Practice is essential** for mastering regex patterns - start with simple character classes and gradually work up to complex patterns.

- **Character classes** `[ ]` are fundamental for matching sets of characters, with negation `[^]` and ranges `[a-z]` being powerful tools.

- **Quantifiers** (`*`, `+`, `?`, `{}`) control repetition and are crucial for matching variable-length patterns.

- **Anchors** (`^`, `$`, `\b`) match positions rather than characters, enabling precise pattern matching.

- **Groups** `()` allow you to capture and extract specific parts of matches, which is essential for data extraction tasks.

- **Word boundaries** `\b` are essential for matching complete words without partial matches.

- **Testing and debugging** regex patterns is easier with online tools like regex101.com before implementing in code.

- Real-world regex applications require combining multiple concepts: character classes, quantifiers, anchors, and groups work together to solve complex text processing problems.