## Regex

Regular Expressions (regex) are sequences of characters that define a search pattern. 
They are commonly used for pattern matching within strings, such as finding words, validating input, or replacing text.

In [2]:
# This is how we import the module
import re

### 1. Basic Regex Syntax

Let's start with a simple example: matching a word "apple" in a string, much like keyword searching

In [3]:
text = "I like to eat apple and banana."
pattern = r"apple"

In [8]:
match = re.search(pattern, text)
if match:
    print("Found:", match.group())

Found: apple


In [9]:
# Where is the match in the text?
print("Match found at index:", match.start())
print("Match ends at index:", match.end())
print("Match span:", match.span())

Match found at index: 14
Match ends at index: 19
Match span: (14, 19)


### 2. Find All Occurrences

You can use `re.findall()` to find all occurrences of a pattern in a string.

In [10]:
# Find all words in the text split by whitespace
matches = re.findall(r"\b\w+\b", text)
print("Words in text:", matches)


Words in text: ['I', 'like', 'to', 'eat', 'apple', 'and', 'banana']


### 3. Matching Digits

We can also use regex to match digits in a string. The pattern `\d` matches any digit.


In [11]:
text_with_numbers = "My phone number is 123-456-7890."
digits = re.findall(r"\d", text_with_numbers)
print("Digits in the text:", digits)


Digits in the text: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


### 4. Using Groups in Regex

Groups allow you to capture parts of the match. You can use parentheses to define groups in the pattern.

In [12]:
# Code to capture area code from a phone number
phone_number = "My phone number is 123-456-7890."
match = re.search(r"(\d{3})-(\d{3})-(\d{4})", phone_number)

if match:
    print("Area code:", match.group(1))  # The first group (area code)
    print("Full number:", match.group())  # The whole matched string

Area code: 123
Full number: 123-456-7890


### 5. Replacing Text

We can use `re.sub()` to replace parts of the string that match a pattern.

In [13]:
replaced_text = re.sub(r"\d", "X", text_with_numbers)
print("Replaced text:", replaced_text)

Replaced text: My phone number is XXX-XXX-XXXX.


### 6. Validating an Email Address

A common use case of regex is validating user input. Let's try to validate an email address using a regex pattern.

In [14]:
# Code to validate email
email = "test@example.com"
email_pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"

if re.match(email_pattern, email):
    print(f"{email} is a valid email.")
else:
    print(f"{email} is not a valid email.")


test@example.com is a valid email.


Can you write an email in such a way that this regex pattern cannot recognize it?

In [None]:
# YOUR CODE HERE

email_hard = ...
email_pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"

if re.match(email_pattern, email_hard):
    print(f"{email} is a valid email.")
else:
    print(f"{email} is not a valid email.")


# Overview of RegEx Characters

## 1. **Basic Characters**
   - **Literal characters**: Matches the exact characters in a string.
     - Example: `apple` matches the string `"apple"`.
   
## 2. **Metacharacters**
   These characters have a special meaning in regex:

   - **Dot (`.`)**: Matches any single character except newline (`\n`).
     - Example: `a.c` matches `"abc"`, `"axc"`, `"a-c"`, but not `"ac"` (since it needs a character between "a" and "c").
   
   - **Caret (`^`)**: Anchors the match to the start of the string.
     - Example: `^apple` matches `"apple pie"` but not `"I like apple"`.
   
   - **Dollar (`$`)**: Anchors the match to the end of the string.
     - Example: `apple$` matches `"I love apple"` but not `"apple pie"`.
   
   - **Square Brackets (`[]`)**: Denote a character class that matches any one of the characters inside the brackets.
     - Example: `[aeiou]` matches any vowel.
   
   - **Hyphen inside brackets (`-`)**: Defines a range of characters.
     - Example: `[a-z]` matches any lowercase letter.
   
   - **Caret inside brackets (`[^]`)**: Matches any character except those inside the brackets.
     - Example: `[^a-z]` matches any character except a lowercase letter.

## 3. **Quantifiers**
   These specify how many instances of the preceding element are required:

   - **Asterisk (`*`)**: Matches 0 or more occurrences of the preceding element.
     - Example: `a*` matches `""`, `"a"`, `"aa"`, etc.
   
   - **Plus (`+`)**: Matches 1 or more occurrences of the preceding element.
     - Example: `a+` matches `"a"`, `"aa"`, `"aaa"`, but not `""`.
   
   - **Question mark (`?`)**: Matches 0 or 1 occurrence of the preceding element (makes it optional).
     - Example: `a?` matches `""` and `"a"`.
   
   - **Braces (`{n,m}`)**: Matches between `n` and `m` occurrences of the preceding element.
     - Example: `a{2,4}` matches `"aa"`, `"aaa"`, and `"aaaa"` but not `"a"` or `"aaaaa"`.

## 4. **Character Classes**
   These represent common sets of characters:

   - **`\d`**: Matches any digit (equivalent to `[0-9]`).
   - **`\D`**: Matches any non-digit (equivalent to `[^0-9]`).
   - **`\w`**: Matches any word character (letters, digits, and underscores, equivalent to `[a-zA-Z0-9_]`).
   - **`\W`**: Matches any non-word character (opposite of `\w`).
   - **`\s`**: Matches any whitespace character (spaces, tabs, newlines).
   - **`\S`**: Matches any non-whitespace character.
   - **`\b`**: Matches a word boundary.
   - **`\B`**: Matches a non-word boundary.

## 5. **Grouping and Capturing**
   - **Parentheses (`()`)**: Used to group parts of a regex pattern and capture them for later reference.
     - Example: `(abc)+` matches `"abc"`, `"abcabc"`, etc., and groups the `"abc"` part for later use.

## 6. **Alternation**
   - **Pipe (`|`)**: Acts like an "OR" operator to match either of two patterns.
     - Example: `apple|banana` matches `"apple"` or `"banana"`.

## 7. **Escape Sequences**
   - **Backslash (`\`)**: Escapes a special character to treat it as a literal.
     - Example: `\.` matches a literal dot `"."` (instead of any character).
   
   - **Escaping metacharacters**: You can use a backslash to escape metacharacters like `\^`, `\$`, `\*`, etc.
     - Example: `\*` matches a literal asterisk `"*"`.
   
## 8. **Lookahead and Lookbehind (Advanced)**
   - **Positive lookahead (`(?=...)`)**: Ensures a pattern is followed by another pattern.
     - Example: `\d(?=\D)` matches a digit that is followed by a non-digit.
   
   - **Negative lookahead (`(?!...)`)**: Ensures a pattern is not followed by another pattern.
     - Example: `\d(?!\d)` matches a digit not followed by another digit.
   
   - **Lookbehind (`(?<=...)`)**: Ensures a pattern is preceded by another pattern.
     - Example: `(?<=@)\w+` matches a word after the "@" symbol in an email address.
   
   - **Negative lookbehind (`(?<!...)`)**: Ensures a pattern is not preceded by another pattern.

## 9. **Flags**
   Flags modify the behavior of the regex:

   - **`re.IGNORECASE` (or `re.I`)**: Makes the regex case-insensitive.
     - Example: `re.search(r"apple", "Apple", re.I)` will match `"Apple"`.
   
   - **`re.MULTILINE` (or `re.M`)**: Allows `^` and `$` to match the start and end of each line, not just the start and end of the whole string.
   
   - **`re.DOTALL` (or `re.S`)**: Allows the dot (`.`) to match newline characters.
   
   - **`re.VERBOSE` (or `re.X`)**: Allows you to write regex with comments and more readable formatting.

### Recommendations

I prefer using online services to test my regex. Here is the one that I use

[regexr.com](https://regexr.com/)

