# Regular Expressions

+ What is Regex?
    + Patterns that can be used to search for certain characters or words inside strings
    + Edit, delete certain characters or words, substitute one thing for another, extract information from any files or strings that contain that particular pattern.

+ Why should we learn and use Regex?
    + Save time for locating information and text processing
    + Practical uses: Batch files renaming, parsing logs, validating forms, making mass edit in codebase, and recursive search


# Python Regular Expression Exercises
We use Python's re module that provides support for regular expressions.

## Basic Concepts

The regular expression start with `r`

We use the following three functions to match the regular expressions:

`re.match()`
Start from the first character to match, if the first is not matching, return None.

`re.search`
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding Match. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

`re.findall()`
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.

see details on https://docs.python.org/3/library/re.html

## Exact Match

In [1]:
# Import the re module
import re

# Test string
test_string = "Hello, my name is LiHua. I'm 15 years old. My phone number is: 010-87654321."

# Define pattern - exact match
pattern1 = r'YOUR REGEX HERE'   # Match the string "Hello"
result1 = re.findall(pattern1, test_string) # Find all the matched strings

print(result1) # Should be ['Hello']

[]


## Match Digit

In [None]:
# Define pattern - match number
pattern2 = r'YOUR REGEX HERE'   # Match any digit
result2 = re.findall(pattern2, test_string) # Find all the matched strings

print(result2) # Should be ['1', '5', '0', '1', '0', '8', '7', '6', '5', '4', '3', '2', '1']

## Match Letter

In [None]:
# Define pattern - match letter
pattern3 = r'YOUR REGEX HERE'   # Match any letter
result3 = re.findall(pattern3, test_string) # Find all the matched strings

print(result3) # Should be ['H', 'e', 'l', 'l', 'o', 'm', 'y', 'n', 'a', 'm', 'e', 'i', 's', 'L', 'i', 'H', 'u', 'a', 'I', 'm', 'y', 'e', 'a', 'r', 's', 'o', 'l', 'd', 'M', 'y', 'p', 'h', 'o', 'n', 'e', 'n', 'u', 'm', 'b', 'e', 'r', 'i', 's']

## Match with Quantifier

In [None]:
pattern5 = r'YOUR REGEX HERE' # One or more digits
pattern6 = r'YOUR REGEX HERE' # Exactly 3 digits

result5 = re.findall(pattern5, test_string)
result6 = re.findall(pattern6, test_string)


print(result5) # Should be ['15', '010', '87654321']
print(result6) # Should be ['010', '876', '543']

# Bonus Exercises (difficult)

## Introduction to Group with ()

**Capture groups** are sub-expressions enclosed in parentheses

See example beneath

## Exercise: Log Parsing
Regex can be used for parsing logs, especially when you are focusing on specific informations. For example, in machine learning, we have training logs.

Now we want to extract the train loss and val_loss respectively by using the regular expression and `re.findall()` function.

`re.findall()` has a special behavior:
+ If there are NO capturing groups (no parentheses), it returns the entire matched string
+ If there ARE capturing groups, it ONLY returns what's inside the groups

see https://docs.python.org/3/library/re.html#re.findall for reference

In [None]:
import re

training_log = """
Epoch 1/10
500/500 [==============================] - 5s 10ms/step - loss: 0.6931 - accuracy: 0.5000 - val_loss: 0.6928 - val_accuracy: 0.5000
Epoch 2/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6925 - accuracy: 0.5100 - val_loss: 0.6923 - val_accuracy: 0.5100
Epoch 3/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6918 - accuracy: 0.5200 - val_loss: 0.6917 - val_accuracy: 0.5200
Epoch 4/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6912 - accuracy: 0.5300 - val_loss: 0.6911 - val_accuracy: 0.5300
Epoch 5/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6905 - accuracy: 0.5400 - val_loss: 0.6905 - val_accuracy: 0.5400
Epoch 6/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6899 - accuracy: 0.5500 - val_loss: 0.6899 - val_accuracy: 0.5500
Epoch 7/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6892 - accuracy: 0.5600 - val_loss: 0.6893 - val_accuracy: 0.5600
Epoch 8/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6886 - accuracy: 0.5700 - val_loss: 0.6887 - val_accuracy: 0.5700
Epoch 9/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6879 - accuracy: 0.5800 - val_loss: 0.6881 - val_accuracy: 0.5800
Epoch 10/10
500/500 [==============================] - 4s 8ms/step - loss: 0.6873 - accuracy: 0.5900 - val_loss: 0.6875 - val_accuracy: 0.5900
"""

# Extract loss and val_loss using regex
# TO DO, try with Grouping ()
loss_pattern = r'YOUR REGEX HERE'
# To DO, try with Grouping ()
val_loss_pattern = r'YOUR REGEX HERE'

# Find all matches and convert them to float
loss_values = [float(x) for x in re.findall(loss_pattern, training_log)]
val_loss_values = [float(x) for x in re.findall(val_loss_pattern, training_log)]

print("Loss values:", loss_values)
print("Validation loss values:", val_loss_values)

## Exercise: Validate Email Address

Regex can be used for validating string's format, e.g., eamils, phone-numebr, and passwords. 

Please write a Regex, which can match the following two eamil formats:
+ someone@gmail.com
+ bill.gates@microsoft.com

Hints:
+ The address before @ must start with letters, numbers, or dots
+ Use `[a-zA-Z0-9.]` to match letters, numbers, and dots
+ Use `\w` to match letters and numbers
+ Use `+` to match one or more patterns
+ Remember to escape the dot in domain extension (\.com) using backslash. Use `\.` to match dot

In [None]:
import re

def is_valid_email(addr):
    # TO DO 
    pattern = r'YOUR REGEX HERE'
    return bool(re.match(pattern, addr))

# Test
assert is_valid_email('someone@gmail.com')
assert is_valid_email('bill.gates@microsoft.com')
assert not is_valid_email('bob#example.com')
assert not is_valid_email('mr-bob@example.com')
print('OK')

## Exercise: File Renaming

We can use Regex to rename the filenames by replacing the origin names with new pattern. The `re.sub(pattern, repl, string)` provide the function to replace the former `pattern` with the latter `repl` in `string`, the `repl` can be a string (Regex) or a function.

see https://docs.python.org/3/library/re.html#re.sub for reference.

You need to write a function that renames files according to these rules:
1. Extract the date from filenames in format YYYYMMDD
2. Convert it to YYYY-MM-DD format
3. Keep the rest of the filename unchanged

Examples of files to rename:
- photo_20240315_paris.jpg → photo_2024-03-15_paris.jpg
- doc_20231225_final.pdf → doc_2023-12-25_final.pdf
- note_20220930.txt → note_2022-09-30.txt

Write a function that takes a filename and returns the new filename.

Hints:
1. Look for 8 digits in sequence (YYYYMMDD pattern)
2. Use capturing groups to keep parts before and after the date
3. Refer to the groups parts by using `\1`- refer to first part, `\2`- refer to second part and so on.
4. Think about how to insert the dashes in the right positions

In [None]:
import re

def rename_file(filename):
    # TODO: Write your regex pattern and replacement
    pattern = r'YOUR REGEX HERE'
    replacement = r'YOUR REPLACEMENT HERE'

    # Replace the filename
    filename = re.sub(pattern, replacement, filename)
    
    return filename

# Tests
assert rename_file("photo_20240315_paris.jpg") == "photo_2024-03-15_paris.jpg"
assert rename_file("doc_20231225_final.pdf") == "doc_2023-12-25_final.pdf"
assert rename_file("note_20220930.txt") == "note_2022-09-30.txt"
assert rename_file("test_20241231.docx") == "test_2024-12-31.docx"
print("All tests passed!")

In [None]:
import re

def is_valid_email(addr):
    # Pattern: letters/numbers/dots @ word . letters
    pattern = r'[a-zA-Z0-9.]+@\w+\.[a-zA-Z]+'
    return bool(re.match(pattern, addr))

    # Test cases
assert is_valid_email('someone@gmail.com')          # ✓ Valid
assert is_valid_email('bill.gates@microsoft.com')   # ✓ Valid
assert not is_valid_email('bob#example.com')        # ✗ Invalid (# not allowed)
assert not is_valid_email('mr-bob@example.com')     # ✗ Invalid (- not allowed)
