## What is a Regular Expression?

* A regular expression is a special sequence of characters that helps you match or find other strings or sets of characters using a specialized pattern.
* Think of it like a search engine that can identify patterns in a body of text, not just fixed strings.

For example:

\\d: Matches any digit (0-9).

\\w: Matches any word character (alphanumeric + underscore).

\\s: Matches any whitespace characters.

.: Matches any character except a newline.

\*: Matches 0 or more repetitions of the preceding element.

+: Matches 1 or more repetitions of the preceding element.

^: Matches the start of a string.

$: Matches the end of a string.

For "." to match \\n  use re.DOTALL flag.

{x}: Matches exactly x occurances.

[xyz]: Matches either of the characters x, y, z.

|: Means "OR".


# Python's "re" module helps to handle regular expressions

In [None]:
# Common functions in re: search(), match(), findall(), sub()

import re

# Example text containing dates in different formats
text = """
The first meeting was held on 12-05-2023.
The second meeting was scheduled for 03/15/2024.
A third one took place on 2025-01-10.
"""

# 1. Simple Date Pattern Matching
# Match dates in the format "DD-MM-YYYY", "DD/MM/YYYY", or "YYYY-MM-DD"
pattern = r'\d{2}[-/]\d{2}[-/]\d{4}|\d{4}-\d{2}-\d{2}'
#\d{2} to find two numbers 0-9
#square brackets match any of both inside
#OR for two different sequences


# Find all matching dates
dates = re.findall(pattern, text)
print("Found dates:", dates)

Found dates: ['12-05-2023', '03/15/2024', '2025-01-10']


Additional re Functions:


Searching


In [None]:
match = re.search(r'\d{4}-\d{2}-\d{2}', text)
if match:
    print("First date found:", match.group())

#search will look for the first occurence


First date found: 2025-01-10


Substitution

In [None]:
new_text = re.sub(r'\d{2}-\d{2}-\d{4}', 'DD-MM-YYYY', text)
print("Updated text:", new_text)

#substitutes the first one


Updated text: 
The first meeting was held on DD-MM-YYYY.
The second meeting was scheduled for 03/15/2024.
A third one took place on 2025-01-10.



# re.compile()

* The re.compile() function is used to compile a regular expression pattern into a regular expression object, which can then be used for matching operations.
* The main advantage of using re.compile() is that it improves efficiency when you need to use the same pattern multiple times.


Link to regular expressions library in python:  https://docs.python.org/3/library/re.html

In [None]:
# Compile the pattern for efficiency
pattern = re.compile(r'\d{2}[-/]\d{2}[-/]\d{4}|\d{4}-\d{2}-\d{2}')

# Use the compiled pattern to find all dates
dates = pattern.findall(text)
print("Found dates:", dates)

#search() – Search for the first occurrence that matches the pattern.
match = pattern.search(text)
if match:
    print("First date found:", match.group())

#sub() – Replace text based on the matching pattern.
new_text = pattern.sub('DATE', text)
print("Updated text:", new_text)


Found dates: ['12-05-2023', '03/15/2024', '2025-01-10']
First date found: 12-05-2023
Updated text: 
The first meeting was held on DATE.
The second meeting was scheduled for DATE.
A third one took place on DATE.



Hands-on-activity:
input_data.txt is the data you will be working with.
Given information:
* The data is stored in lines and is clustered seperately into 3 categories:  "Customer Details:", "Order Information:", and "Transactions History:". You can assume that each of the category data is stored continuously in the file and is separated by empty lines between categories.
* Customer row always starts with customer name.
* Order ID is always in format "XXX-XXX-XXX" where X is a number between 0 to 9. The ordered date appears first and then the delivery date.
* Transation ID starts with one or two capital alphabets followed by a "-" and 5 digits.
* Other patterns must be observed by you if applicable and must be used.


In-Class Tasks:
- Extract customer names, their corresponding emails, and their date of birth, all as strings from the document.

Create appropriate structures to extract store and print the following:
- For each customer: {customer_name:\[date_of_birth,email_id,phone_number]}



In [2]:
import re

# Regex patterns for name, email, and DOB
name_pattern = re.compile(r'[A-Za-z\s,.Jr]+')  # Extracts name at start of line. uses + for repition of each seq char
email_pattern = re.compile(r'[\w\.-]+@[\w\.-]+')  # Matches emails
dob_pattern = re.compile(r'\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{4}|\d{1,2}(st|nd|rd|th)?\s\w+\s\d{4}')  # Matches different DOB formats

customer_details=[]

with open('re_input_data.txt', 'r') as file:
  for line in file:
      line = line.strip()  # Remove spaces and newlines


      name_match = name_pattern.search(line)
      email_match = email_pattern.search(line)
      dob_match = dob_pattern.search(line)

      if name_match and email_match and dob_match:
            customer_details.append({
                "Name": name_match.group(),
                "Email": email_match.group(),
                "DOB": dob_match.group()
            })


for customer in customer_details:
  print(f"Name: {customer['Name']}, Email: {customer['Email']}, DOB: {customer['DOB']}")




Name: James McCoy , Email: james.mccoy@some-email.com, DOB: 1983-09-12
Name: Emma Stone , Email: emma.r.stone@website.com, DOB: 12th November 1992
Name: Michael Harper, Jr. , Email: michael.harper@domain.org, DOB: 06/15/1985



Practice Tasks (outside of class):
- Extract customer names and their corresponding emails from the document (automatically checking if it is a valid email).
- Extract all order totals and ensure they are in numerical format.
- Normalize all dates in the document to a consistent format (e.g., YYYY-MM-DD).
- Extract phone numbers in different formats, including those with extensions and international formats.


Create appropriate structures to extract store and print the following:
- For each customer: {customer_name:\[date_of_birth,email_id (if not valid store "-"),phone_number]}
- For each order: {order_id: \[amount, ordered_date, delivery_date]}
- For each Transaction: {transaction_id: \[amount, date]}

In [None]:
with open('re_input_data.txt', 'r') as file:
  for line in file:
    file.strip()


In [None]:
Customer Details:
James McCoy - Email: james.mccoy@some-email.com - Contact: (555) 123-4567 - Address: 745 Pineapple St, San Francisco, CA 94109 - DOB: 1983-09-12
Emma Stone - Email: emma.r.stone@website.com - Phone: 123-555-7890 ext. 1234 - Address: 56 Lavender Rd, Apt 2B, NY, NY 10001 - DOB: 12th November 1992
Michael Harper, Jr. - Contact: +1-555-9876543 - Address: 3000 Ocean Dr #506, Miami Beach, FL - DOB: 06/15/1985 - Email: michael.harper@domain.org

Order Information:
Order Number: 455-213-789, Items: 3x Laptops, 2x Keyboards, 1x Headset, Total: $1150.75, Date: 2023-02-14, Shipping: Free Shipping, Arrival: February 19, 2023
Order ID: 556-347-900, Items Ordered: 1 Smartphone, 1 Charger, Total Paid: $749.99, Date: 2023/02/16, Shipping Method: Expedited, Delivery Date: 2023-02-19
Order ID: 989-432-110, 2 Tablets, 3 Speakers, 1 Keyboard, Price: $2100.10, Ordered on: 2023.02.17, Expected Arrival: 2023-02-21

Transactions History:
Transaction ID: TX-12345, Amount: $1150.75, Paid by: Credit Card, Date: 2023-02-15
Transaction: P-56789, $749.99, PayPal, Date: 16-02-2023
Payment ID: P-99999, Amount Paid: $2100.10, Via: Debit Card, Date: 2023-02-18
