# Regular Expressions Tutorial with Python

This Jupyter notebook demonstrates how to use regular expressions in Python for data processing and analysis using the `members.csv` dataset.

## Table of Contents
1. [Introduction](#introduction)
2. [Basic Regex Operations](#basic-regex-operations)
3. [Data Loading and Exploration](#data-loading-and-exploration)
4. [Email Validation](#email-validation)
5. [Date Processing](#date-processing)
6. [Address Analysis](#address-analysis)
7. [Credit Score Analysis](#credit-score-analysis)
8. [Advanced Pattern Matching](#advanced-pattern-matching)
9. [Data Validation](#data-validation)
10. [Summary and Best Practices](#summary-and-best-practices)


## Introduction

Regular expressions (regex) are powerful tools for pattern matching and text processing. In Python, we use the `re` module for regex operations. This tutorial will show you how to apply regex patterns to real data analysis tasks.


In [None]:
# Import necessary libraries
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")


## Basic Regex Operations

Let's start with some basic regex operations to understand the fundamentals.


In [None]:
# Basic regex patterns
text = "Hello, my email is john.doe@example.com and my phone is (555) 123-4567"

# Email pattern
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print(f"Found emails: {emails}")

# Phone pattern
phone_pattern = r'\(\d{3}\)\s\d{3}-\d{4}'
phones = re.findall(phone_pattern, text)
print(f"Found phones: {phones}")

# Date pattern
date_text = "Today is 2024-01-15 and tomorrow is 2024-01-16"
date_pattern = r'\d{4}-\d{2}-\d{2}'
dates = re.findall(date_pattern, date_text)
print(f"Found dates: {dates}")


## Data Loading and Exploration

Let's load our dataset and explore its structure.


In [None]:
# Load the dataset
df = pd.read_csv('../members.csv')

print("Dataset shape:", df.shape)
print("\nColumn names:")
print(df.columns.tolist())

print("\nFirst few rows:")
df.head()


In [None]:
# Basic information about the dataset
print("Dataset Info:")
df.info()

print("\nBasic Statistics:")
df.describe()


## Email Validation

Let's validate and analyze email addresses using regex patterns.


In [None]:
# Email validation function
def validate_email(email):
    """Validate email format using regex"""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return bool(re.match(pattern, email))

# Test email validation
sample_emails = df['Email'].head(10).tolist()
print("Email validation results:")
for email in sample_emails:
    is_valid = validate_email(email)
    print(f"{email}: {'✓ Valid' if is_valid else '✗ Invalid'}")

# Count valid emails
valid_emails = df['Email'].apply(validate_email).sum()
print(f"\nValid emails: {valid_emails}/{len(df)}")


In [None]:
# Extract email domains
def extract_domain(email):
    """Extract domain from email address"""
    pattern = r'@([a-zA-Z0-9.-]+\.[a-zA-Z]{2,})'
    match = re.search(pattern, email)
    return match.group(1) if match else None

# Extract domains
df['Email_Domain'] = df['Email'].apply(extract_domain)

print("Email domains:")
domain_counts = df['Email_Domain'].value_counts()
print(domain_counts)

# Visualize domain distribution
plt.figure(figsize=(10, 6))
domain_counts.plot(kind='bar')
plt.title('Email Domain Distribution')
plt.xlabel('Domain')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## Date Processing

Let's process and analyze date information using regex patterns.


In [None]:
# Date validation function
def validate_date(date_str):
    """Validate date format YYYY-MM-DD"""
    pattern = r'^\d{4}-\d{2}-\d{2}$'
    if re.match(pattern, date_str):
        try:
            datetime.strptime(date_str, '%Y-%m-%d')
            return True
        except ValueError:
            return False
    return False

# Validate dates
valid_dates = df['DoB'].apply(validate_date).sum()
print(f"Valid dates: {valid_dates}/{len(df)}")

# Extract birth years
def extract_year(date_str):
    """Extract year from date string"""
    pattern = r'^(\d{4})'
    match = re.match(pattern, date_str)
    return int(match.group(1)) if match else None

df['Birth_Year'] = df['DoB'].apply(extract_year)
print(f"\nBirth year range: {df['Birth_Year'].min()} - {df['Birth_Year'].max()}")

# Calculate age
current_year = 2024
df['Age'] = current_year - df['Birth_Year']
print(f"Age range: {df['Age'].min()} - {df['Age'].max()}")


In [None]:
# Visualize age distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
df['Age'].hist(bins=20, edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
df['Birth_Year'].hist(bins=20, edgecolor='black')
plt.title('Birth Year Distribution')
plt.xlabel('Birth Year')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


## Address Analysis

Let's analyze addresses using regex patterns to extract states, cities, and ZIP codes.


In [None]:
# Extract state from address
def extract_state(address):
    """Extract state abbreviation from address"""
    pattern = r', ([A-Z]{2}) \d{5}$'
    match = re.search(pattern, address)
    return match.group(1) if match else None

# Extract ZIP code
def extract_zip(address):
    """Extract ZIP code from address"""
    pattern = r'(\d{5})$'
    match = re.search(pattern, address)
    return match.group(1) if match else None

# Extract city
def extract_city(address):
    """Extract city from address"""
    pattern = r'"([^"]+), ([^,]+), [A-Z]{2} \d{5}"$'
    match = re.search(pattern, address)
    return match.group(2) if match else None

# Apply extraction functions
df['State'] = df['Address'].apply(extract_state)
df['ZIP'] = df['Address'].apply(extract_zip)
df['City'] = df['Address'].apply(extract_city)

print("State distribution:")
state_counts = df['State'].value_counts()
print(state_counts)

print("\nCity distribution (top 10):")
city_counts = df['City'].value_counts().head(10)
print(city_counts)


In [None]:
# Visualize geographic distribution
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
state_counts.plot(kind='bar')
plt.title('State Distribution')
plt.xlabel('State')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 3, 2)
city_counts.plot(kind='bar')
plt.title('Top 10 Cities')
plt.xlabel('City')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
df['ZIP'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 ZIP Codes')
plt.xlabel('ZIP Code')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


## Credit Score Analysis

Let's analyze credit scores and their patterns.


In [None]:
# Credit score validation
def validate_credit_score(score):
    """Validate credit score range"""
    pattern = r'^\d{3}$'
    if re.match(pattern, str(score)):
        return 300 <= int(score) <= 850
    return False

# Validate credit scores
valid_scores = df['Credit score'].apply(validate_credit_score).sum()
print(f"Valid credit scores: {valid_scores}/{len(df)}")

# Credit score categories
def categorize_credit_score(score):
    """Categorize credit score into ranges"""
    if score >= 800:
        return 'Excellent'
    elif score >= 700:
        return 'Good'
    elif score >= 600:
        return 'Fair'
    else:
        return 'Poor'

df['Credit_Category'] = df['Credit score'].apply(categorize_credit_score)

print("\nCredit score categories:")
category_counts = df['Credit_Category'].value_counts()
print(category_counts)


In [None]:
# Visualize credit score distribution
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
df['Credit score'].hist(bins=20, edgecolor='black')
plt.title('Credit Score Distribution')
plt.xlabel('Credit Score')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
category_counts.plot(kind='bar')
plt.title('Credit Score Categories')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.subplot(1, 3, 3)
df.boxplot(column='Credit score', by='Gender')
plt.title('Credit Score by Gender')
plt.xlabel('Gender')
plt.ylabel('Credit Score')

plt.tight_layout()
plt.show()


## Advanced Pattern Matching

Let's explore more advanced regex patterns and their applications.


In [None]:
# Advanced name pattern matching
def analyze_name_patterns(names):
    """Analyze name patterns using regex"""
    patterns = {
        'First_Last': r'^[A-Z][a-z]+ [A-Z][a-z]+$',
        'With_Middle': r'^[A-Z][a-z]+ [A-Z][a-z]+ [A-Z][a-z]+$',
        'Hyphenated': r'^[A-Z][a-z]+-[A-Z][a-z]+ [A-Z][a-z]+$',
        'Single_Letter': r'^[A-Z]\.[A-Z][a-z]+$'
    }
    
    results = {}
    for pattern_name, pattern in patterns.items():
        matches = names.str.match(pattern, na=False).sum()
        results[pattern_name] = matches
    
    return results

# Analyze name patterns
name_patterns = analyze_name_patterns(df['Full name'])
print("Name pattern analysis:")
for pattern, count in name_patterns.items():
    print(f"{pattern}: {count}")

# Extract first and last names
def extract_first_name(name):
    """Extract first name"""
    pattern = r'^([A-Z][a-z]+)'
    match = re.match(pattern, name)
    return match.group(1) if match else None

def extract_last_name(name):
    """Extract last name"""
    pattern = r'([A-Z][a-z]+)$'
    match = re.search(pattern, name)
    return match.group(1) if match else None

df['First_Name'] = df['Full name'].apply(extract_first_name)
df['Last_Name'] = df['Full name'].apply(extract_last_name)

print("\nMost common first names:")
print(df['First_Name'].value_counts().head(5))

print("\nMost common last names:")
print(df['Last_Name'].value_counts().head(5))


In [None]:
# Advanced email pattern analysis
def analyze_email_patterns(emails):
    """Analyze email patterns"""
    patterns = {
        'First_Last': r'^[a-zA-Z]+\.[a-zA-Z]+@',
        'First_Initial_Last': r'^[a-zA-Z]+\.[a-zA-Z]@',
        'Numbers': r'[0-9]+@',
        'Underscores': r'[a-zA-Z]+_[a-zA-Z]+@',
        'Hyphens': r'[a-zA-Z]+-[a-zA-Z]+@'
    }
    
    results = {}
    for pattern_name, pattern in patterns.items():
        matches = emails.str.match(pattern, na=False).sum()
        results[pattern_name] = matches
    
    return results

# Analyze email patterns
email_patterns = analyze_email_patterns(df['Email'])
print("Email pattern analysis:")
for pattern, count in email_patterns.items():
    print(f"{pattern}: {count}")

# Visualize email patterns
plt.figure(figsize=(10, 6))
pattern_names = list(email_patterns.keys())
pattern_counts = list(email_patterns.values())
plt.bar(pattern_names, pattern_counts)
plt.title('Email Pattern Distribution')
plt.xlabel('Pattern Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


## Data Validation

Let's create comprehensive data validation using regex patterns.


In [None]:
# Comprehensive data validation
def validate_dataset(df):
    """Validate entire dataset using regex patterns"""
    validation_results = {
        'Email': df['Email'].apply(validate_email).sum(),
        'Date': df['DoB'].apply(validate_date).sum(),
        'Credit_Score': df['Credit score'].apply(validate_credit_score).sum(),
        'Name': df['Full name'].str.match(r'^[A-Z][a-z]+ [A-Z][a-z]+$', na=False).sum(),
        'Gender': df['Gender'].isin(['Male', 'Female']).sum(),
        'Height': df['Height (cm)'].between(100, 250).sum(),
        'Weight': df['Weight (kg)'].between(30, 200).sum()
    }
    
    return validation_results

# Run validation
validation_results = validate_dataset(df)
total_records = len(df)

print("Data Validation Results:")
print("=" * 40)
for field, valid_count in validation_results.items():
    percentage = (valid_count / total_records) * 100
    print(f"{field:15}: {valid_count:3}/{total_records} ({percentage:5.1f}%)")

# Calculate overall data quality score
overall_score = sum(validation_results.values()) / (len(validation_results) * total_records) * 100
print(f"\nOverall Data Quality Score: {overall_score:.1f}%")


In [None]:
# Visualize validation results
plt.figure(figsize=(12, 8))

plt.subplot(2, 2, 1)
fields = list(validation_results.keys())
valid_counts = list(validation_results.values())
percentages = [(count / total_records) * 100 for count in valid_counts]

bars = plt.bar(fields, percentages)
plt.title('Data Validation Results')
plt.xlabel('Field')
plt.ylabel('Valid Percentage (%)')
plt.xticks(rotation=45)

# Add percentage labels on bars
for bar, percentage in zip(bars, percentages):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
             f'{percentage:.1f}%', ha='center', va='bottom')

plt.subplot(2, 2, 2)
plt.pie(percentages, labels=fields, autopct='%1.1f%%')
plt.title('Data Quality Distribution')

plt.subplot(2, 2, 3)
plt.hist(df['Credit score'], bins=20, edgecolor='black', alpha=0.7)
plt.title('Credit Score Distribution')
plt.xlabel('Credit Score')
plt.ylabel('Frequency')

plt.subplot(2, 2, 4)
df['Age'].hist(bins=20, edgecolor='black', alpha=0.7)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


## Summary and Best Practices

Let's summarize what we've learned and provide some best practices for using regular expressions in data analysis.


In [None]:
# Summary statistics
print("Dataset Summary:")
print("=" * 50)
print(f"Total records: {len(df)}")
print(f"Valid emails: {validation_results['Email']}/{len(df)}")
print(f"Valid dates: {validation_results['Date']}/{len(df)}")
print(f"Valid credit scores: {validation_results['Credit_Score']}/{len(df)}")
print(f"Valid names: {validation_results['Name']}/{len(df)}")
print(f"Valid genders: {validation_results['Gender']}/{len(df)}")
print(f"Valid heights: {validation_results['Height']}/{len(df)}")
print(f"Valid weights: {validation_results['Weight']}/{len(df)}")

print(f"\nOverall data quality: {overall_score:.1f}%")

# Data insights
print("\nData Insights:")
print("=" * 50)
print(f"Average credit score: {df['Credit score'].mean():.1f}")
print(f"Average age: {df['Age'].mean():.1f}")
print(f"Average height: {df['Height (cm)'].mean():.1f} cm")
print(f"Average weight: {df['Weight (kg)'].mean():.1f} kg")
print(f"Gender distribution: {df['Gender'].value_counts().to_dict()}")
print(f"Most common state: {df['State'].mode().iloc[0] if not df['State'].mode().empty else 'N/A'}")
print(f"Most common email domain: {df['Email_Domain'].mode().iloc[0] if not df['Email_Domain'].mode().empty else 'N/A'}")


## Best Practices for Regex in Data Analysis

1. **Always validate your patterns**: Test regex patterns on sample data before applying to large datasets
2. **Use raw strings**: Prefix regex patterns with `r` to avoid escaping issues
3. **Handle edge cases**: Consider empty strings, null values, and unexpected formats
4. **Document your patterns**: Add comments to complex regex patterns
5. **Test performance**: Some patterns can be slow on large datasets
6. **Use built-in functions**: Leverage pandas string methods when possible
7. **Validate results**: Always verify your regex results make sense
8. **Consider alternatives**: Sometimes string methods are more efficient than regex

## Common Regex Patterns for Data Analysis

- **Email**: `r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'`
- **Phone**: `r'\(\d{3}\)\s\d{3}-\d{4}'`
- **Date**: `r'^\d{4}-\d{2}-\d{2}$'`
- **ZIP Code**: `r'^\d{5}(-\d{4})?$'`
- **Credit Card**: `r'^\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}$'`

## Conclusion

Regular expressions are powerful tools for data cleaning, validation, and analysis. When used correctly, they can significantly improve your data processing workflows. Remember to always test your patterns and consider the performance implications for large datasets.


In [None]:
# Final demonstration: Complex pattern matching
print("Final Demonstration: Complex Pattern Matching")
print("=" * 60)

# Find people with specific characteristics
def find_high_performers(df):
    """Find people with high credit scores and specific characteristics"""
    criteria = (
        (df['Credit score'] > 800) &
        (df['Age'] >= 25) &
        (df['Age'] <= 40) &
        (df['Height (cm)'] > 170)
    )
    return df[criteria]

high_performers = find_high_performers(df)
print(f"High performers found: {len(high_performers)}")
print("\nHigh performers:")
print(high_performers[['Full name', 'Email', 'Credit score', 'Age', 'Height (cm)']].head())

print("\nTutorial completed successfully!")
print("You've learned how to use regular expressions for data analysis in Python.")
