# Session 1.8: Regular Expressions

## **Useful for Data Cleaning and Parsing Text**

### **Learning Objectives**
By the end of this session, you will:
- Use Python's re module for pattern matching
- Clean and extract healthcare data from text
- Apply regex to clinical notes and lab results
- Prepare text data for PySpark processing

---

### **Relevance to PySpark**
Regular expressions are essential for cleaning and parsing text data before loading into PySpark DataFrames.

---

## 1. Basic Pattern Matching

import re

# Example: Extract patient IDs from text
text = 'Patient IDs: PT001, PT002, PT003'
ids = re.findall(r'PT\d{3}', text)
print(f"Extracted IDs: {ids}")

## 2. Cleaning Healthcare Data with Regex

# Example: Clean lab result strings
lab_results = [
    'Glucose: 95 mg/dL',
    'Cholesterol: 220 mg/dL',
    'HbA1c: 7.2%'
]

for result in lab_results:
    match = re.match(r'(\w+):\s*(\d+(?:\.\d+)?)(?:\s*(mg/dL|%))?', result)
    if match:
        test, value, unit = match.groups()
        print(f"Test: {test}, Value: {value}, Unit: {unit}")

## 3. Advanced Text Extraction

# Extract dates from clinical notes
notes = [
    'Visit on 2025-07-20: Patient stable.',
    'Follow-up scheduled for 2025-08-01.'
]

for note in notes:
    dates = re.findall(r'\d{4}-\d{2}-\d{2}', note)
    print(f"Dates found: {dates}")

## 4. Practice Exercise

Write regex to extract medication names and dosages from text.

# Exercise: Extract medication info
med_strings = [
    'Lisinopril 10mg once daily',
    'Metformin 500mg twice daily',
    'Aspirin 81mg as needed'
]

# TODO: Write regex to extract medication name and dosage
# Your code here

---

## Summary

In this session, you learned:
- ✅ How to use regular expressions for pattern matching
- ✅ How to clean and extract healthcare data
- ✅ How to apply regex to clinical notes and lab results
- ✅ Essential skills for text data in PySpark

**Next:** Session 1.9 - Classes