# Regular Expressions (Regex) in Python

### Table of Content

1. What is Regex? Why do we need it?
2. The `re` module in Python
3. Basic regex patterns and syntax
4. Common operations with regex (such as: search, match, findall, replace, split)
5. Using groups, ranges, and repetitions
6. Useful flags (like case-insensitive search)
7. Practical examples
8. Small exercises for you to try

## 1. What is a Regular Expression?

A **Regular Expression (regex)** is a special string pattern used to **search**, **match**, and **manipulate** text.

Think of regex as a **powerful search language** for text.  
Instead of just checking if `"abc"` is in a string, regex lets you do things like:

- Find all words that start with a capital letter.
- Check if a phone number is in the correct format.
- Replace multiple spaces with a single space.
- Extract email addresses from a big block of text.

*and many more...*

### Why is Regex Required?

Some common real-world uses:

- Validating user input (email, phone, password, postal code, etc.)  
- Cleaning messy text data (logs, scraped web data, etc.)  
- Extracting specific pieces of information from large text  
- Search-and-replace patterns in files or strings  

Without regex, you would have to write **a lot of custom code** to do these tasks.
With regex, you can often do it with **one short pattern**.

### A Quick Motivation Example (Without Regex)

Suppose we want to find all words ending with `ing` in a sentence:

In [1]:
text = "I am learning Python and enjoying writing interesting things."

words = text.split()
ing_words = [w for w in words if w.endswith("ing")]

print(ing_words)

['learning', 'enjoying', 'writing', 'interesting']


This works, but:

- What if we want words ending with **`ing` or `ed` or `s`**?
- What if punctuation is attached (like `things.`)?
- What if we want only words with **3–10 letters** that end with `ing`?

This is where **regex** becomes very helpful.

## 2. The `re` Module in Python

Python provides regex functionality through the built-in **`re`** module.

To use regex in Python, you must first **import** the module:

In [2]:
import re

print("re module imported!")

re module imported!


The `re` module provides many useful functions.  
The most commonly used ones are:

- `re.search(pattern, string)`
- `re.match(pattern, string)`
- `re.fullmatch(pattern, string)`
- `re.findall(pattern, string)`
- `re.finditer(pattern, string)`
- `re.sub(pattern, repl, string)`
- `re.split(pattern, string)`
- `re.compile(pattern)`

We will explore each of these step by step.

## 3. Basic Regex Syntax

Before using the functions, we must understand how to **write regex patterns**.

### 3.1 Literal Characters

The simplest pattern is just regular text.

- Pattern: `"cat"`
- Meaning: find `"cat"` exactly in the string.

Example:

In [5]:
import re

text = "The cat sat on the catalog c@t."

pattern = "c.t"
matches = re.findall(pattern, text)

print("Matches:", matches)

Matches: ['cat', 'cat', 'c@t']


Notice that it matched:

- `"cat"`
- The `"cat"` inside `"catalog"`

because both contain the substring `"cat"`.

### 3.2 Special Characters (Metacharacters)

Regex has **special characters** that have special meanings.  
These are some common ones:

| Symbol | Meaning                            | Example pattern |
|--------|------------------------------------|-----------------|
| `.`    | Any single character (except `\n`) | `a.c` matches `abc`, `axc` |
| `^`    | Start of string                    | `^Hello`        |
| `$`    | End of string                      | `world$`        |
| `*`    | 0 or more repetitions             | `ab*`           |
| `+`    | 1 or more repetitions             | `ab+`           |
| `?`    | 0 or 1 repetition (optional)      | `ab?`           |
| `[]`   | Character class (any of these)    | `[abc]`         |
| `|`    | OR (alternation)                  | `cat|dog`       |
| `()`   | Grouping                           | `(ab)+`         |
| `\`   | Escape special characters         | `\.` matches a literal `.` |

We will see examples for each of these soon.

### 3.3 Character Classes: `[]`

A **character class** lets you match **one character from a set**.

- `[abc]` → matches `'a'` or `'b'` or `'c'`
- `[0-9]` → any digit
- `[a-z]` → any lowercase letter
- `[A-Z]` → any uppercase letter
- `[a-zA-Z]` → any letter
- `[^0-9]` → any **non-digit** (`^` at the start means "not" inside `[]`)

Example:

In [6]:
import re

text = "Room numbers: A1, B2, C3, D4."

pattern = r"[A-Z][0-9]"
matches = re.findall(pattern, text)

print("Matches:", matches)

Matches: ['A1', 'B2', 'C3', 'D4']


We used a **raw string** (`r"..."`) for the pattern.  
Raw strings are recommended for regex patterns in Python because they treat backslashes `\` more naturally.

### 3.4 Predefined Character Classes

Some very common sets are built-in:

- `\d` → digit (same as `[0-9]`)
- `\D` → non-digit
- `\w` → word character (letters, digits, underscore) – `[A-Za-z0-9_]`
- `\W` → non-word character
- `\s` → whitespace (space, tab, newline)
- `\S` → non-whitespace

Example:

In [10]:
import re

text = "ID: A12, Code: B_34, Zip: 560001."

pattern = r"\d+"
matches = re.findall(pattern, text)

print("Word-like chunks:", matches)

Word-like chunks: ['12', '34', '560001']


### 3.5 Quantifiers: `*`, `+`, `?`, `{m,n}`

Quantifiers tell **how many times** a part of the pattern can repeat.

- `*` → 0 or more times  
- `+` → 1 or more times  
- `?` → 0 or 1 time (optional)  
- `{m}` → exactly `m` times  
- `{m,}` → at least `m` times  
- `{m,n}` → between `m` and `n` times (inclusive)

Example:

In [32]:
import re

text = "Helloooo! Heellooo! Helo! Heo! Hlo!"

patterns = [
    (r"Hel*o+", "Hel*o+"),
    (r"He?lo+", "He?lo+"),
    (r"He{2}l{1,}o+", "He{2}l{1,}o+"),
    (r"H(el)*o+", "H(el)*o+")
]

for pattern, label in patterns:
    matches = re.findall(pattern, text)
    print(f"Pattern {label!r} ->", matches)

Pattern 'Hel*o+' -> ['Helloooo', 'Helo', 'Heo']
Pattern 'He?lo+' -> ['Helo', 'Hlo']
Pattern 'He{2}l{1,}o+' -> ['Heellooo']
Pattern 'H(el)*o+' -> ['el']


### 3.6 Anchors: `^` and `$`

Anchors do **not** match characters.  
They match **positions** in the string:

- `^` → start of the string (or line, with certain flags)
- `$` → end of the string (or line)

Example:

In [16]:
import re

text = "Hello World\nHello Python"

pattern_start_hello = r"^Hello"
pattern_end_python = r"Python$"

print("Match at start:", re.findall(pattern_start_hello, text))
print("Match at end:", re.findall(pattern_end_python, text))

Match at start: ['Hello']
Match at end: ['Python']


Notice that:

- `^Hello` matches only if the string (by default) starts with `"Hello"`.
- `Python$` matches only if the string ends with `"Python"`.

We will later see how flags can make `^` and `$` work **line by line**.

## 4. Common Regex Operations with `re`

Now let's explore the main functions in the `re` module and **what they do**.

### 4.1 `re.search(pattern, string)`

- Searches **anywhere** in the string for the first match.
- Returns a **match object** if found, else `None`.

In [17]:
import re

text = "My number is 12345."

match = re.search(r"\d+", text)

if match:
    print("Matched text:", match.group())
    print("Start index:", match.start())
    print("End index:", match.end())
else:
    print("No match found")

Matched text: 12345
Start index: 13
End index: 18


### 4.2 `re.match(pattern, string)` vs `re.search`

- `re.match` checks only **at the beginning** of the string.
- `re.search` checks **anywhere** in the string.

Example:

In [19]:
import re

text = "Order number: 5678"

m1 = re.match(r"\d+", text)
m2 = re.search(r"\d+", text)

print("match:", m1)
print("search:", m2.group() if m2 else None)

match: None
search: 5678


You can see that `re.match` returned `None` because the text does **not start** with digits,  
but `re.search` found digits later in the string.

### 4.3 `re.fullmatch(pattern, string)`

- Checks if the **entire string** matches the pattern.
- Useful for validation (e.g., is this string a valid phone number?).

Example:

In [27]:
import re

def is_valid_id(value):
    # Exactly 2 uppercase letters followed by 3 digits
    pattern = r"[A-Z]{2}\d{3}"
    return re.fullmatch(pattern, value) is not None

tests = ["AB123", "A123", "AB1234", "XY999"]

for t in tests:
    print(t, "->", is_valid_id(t))

AB123 -> True
A123 -> False
AB1234 -> False
XY999 -> True


### 4.4 `re.findall(pattern, string)`

- Returns a **list of all non-overlapping matches** of the pattern in the string.

Example:

In [37]:
import re

text = "Emails: a@example.com, b.test@domain.org, invalid@mail.com"

pattern = r"[\w\.]+@[\w\.]+"
emails = re.findall(pattern, text)

print("Extracted emails:", emails)

Extracted emails: ['a@example.com', 'b.test@domain.org', 'invalid@mail.com']


### 4.5 `re.finditer(pattern, string)`

- Returns an **iterator** of match objects (instead of just strings).
- Useful when you want **positions** or other match details.

Example:

In [38]:
import re

text = "abc 123 def 456"

for match in re.finditer(r"\d+", text):
    print("Matched:", match.group(), "| Start:", match.start(), "| End:", match.end())

Matched: 123 | Start: 4 | End: 7
Matched: 456 | Start: 12 | End: 15


### 4.6 `re.sub(pattern, repl, string)`

- Used to **replace** all matches of a pattern with some replacement text.

Example: replace multiple spaces with a single space.

In [40]:
import re

text = "  This   sentence    has   extra   spaces.    "

cleaned = re.sub(r"\s+", " ", text)
print(cleaned)

 This sentence has extra spaces. 


### 4.7 `re.split(pattern, string)`

- Similar to `str.split`, but the separator is a **regex pattern**.

Example: split on **comma or semicolon or whitespace**.

In [41]:
import re

text = "apple, banana;  orange   mango"

parts = re.split(r"[;,\s]+", text)
print(parts)

['apple', 'banana', 'orange', 'mango']


### 4.8 `re.compile(pattern)`

- Compiles a regex pattern into a **pattern object**.
- Useful when you want to use the **same pattern many times**.

Example:

In [42]:
import re

pattern = re.compile(r"[A-Z]{2}\d{3}")

tests = ["AB123", "XY999", "AA12", "ABC123"]

for t in tests:
    if pattern.fullmatch(t):
        print(t, "-> valid")
    else:
        print(t, "-> invalid")

AB123 -> valid
XY999 -> valid
AA12 -> invalid
ABC123 -> invalid


## 5. Groups and Capturing

**Groups** allow you to capture parts of the matched text.

- Use parentheses `()` to create a group.
- Use `.group(n)` to access group `n` (`0` is the whole match).

Example: extract **area code** and **number** from a phone string.

In [43]:
import re

text = "Call me at (080) 12345678 tomorrow."

pattern = r"\((\d{3})\)\s*(\d+)"
match = re.search(pattern, text)

if match:
    print("Full match:", match.group(0))
    print("Area code:", match.group(1))
    print("Number:", match.group(2))

Full match: (080) 12345678
Area code: 080
Number: 12345678


You can also use **named groups**:

- Syntax: `(?P<name>...)`
- Access: `match.group('name')`

In [44]:
import re

text = "Order ID: AB-12345"

pattern = r"Order ID: (?P<prefix>[A-Z]{2})-(?P<number>\d+)"
match = re.search(pattern, text)

if match:
    print("Prefix:", match.group('prefix'))
    print("Number:", match.group('number'))

Prefix: AB
Number: 12345


## 6. Alternation (`|`) – OR

The `|` symbol means **OR** between patterns.

- `cat|dog` matches either `"cat"` or `"dog"`.

Example:

In [45]:
import re

text = "I have a cat, a dog, and a bird."

pattern = r"cat|dog"
matches = re.findall(pattern, text)

print("Matches:", matches)

Matches: ['cat', 'dog']


## 7. Flags (Modifiers)

Flags change **how** the pattern is matched.

Some common flags:

- `re.IGNORECASE` or `re.I` → case-insensitive matching  
- `re.MULTILINE` or `re.M` → `^` and `$` match at the start/end of each line  
- `re.DOTALL` or `re.S` → `.` matches newline as well

You can pass flags as the **third argument** or to `re.compile`.

### 7.1 Case-insensitive search

In [46]:
import re

text = "Python is FUN. I love python."

pattern = r"python"

print("Case-sensitive:", re.findall(pattern, text))
print("Case-insensitive:", re.findall(pattern, text, flags=re.IGNORECASE))

Case-sensitive: ['python']
Case-insensitive: ['Python', 'python']


### 7.2 Multiline flag

With `re.MULTILINE`, `^` and `$` work at **each line**.

In [47]:
import re

text = """first line
second line
third line"""

pattern = r"^\w+"
print("Without MULTILINE:", re.findall(pattern, text))
print("With MULTILINE:", re.findall(pattern, text, flags=re.MULTILINE))

Without MULTILINE: ['first']
With MULTILINE: ['first', 'second', 'third']


## 8. Practical Examples

Let's now solve a few **realistic** tasks with regex.

### 8.1 Extract all URLs from a text

In [48]:
import re

text = """Here are some links:
https://www.example.com
http://test.org
Visit also www.python.org for more info.
"""

pattern = r"(https?://[\w./-]+|www\.[\w./-]+)"
urls = re.findall(pattern, text)

print("Found URLs:")
for url in urls:
    print("-", url)

Found URLs:
- https://www.example.com
- http://test.org
- www.python.org


### 8.2 Check if a string is a **simple** valid email

> Note: Real email validation can be much more complex.  
> Here we use a **simplified** pattern suitable for learning.

In [49]:
import re

def is_valid_email(email: str) -> bool:
    pattern = r"^[\w\.]+@[\w\.]+\.[a-zA-Z]{2,}$"
    return re.fullmatch(pattern, email) is not None

tests = [
    "user@example.com",
    "a.b@test.co",
    "invalid@mail",
    "no_at_symbol.com",
]

for t in tests:
    print(f"{t:20} -> {is_valid_email(t)}")

user@example.com     -> True
a.b@test.co          -> True
invalid@mail         -> False
no_at_symbol.com     -> False


### 8.3 Clean up messy spacing and punctuation

In [50]:
import re

text = "Hello!!!   This   is   a   test...   Really??"

# 1. Replace multiple spaces with single space
step1 = re.sub(r"\s+", " ", text)

# 2. Replace multiple exclamation/question marks with one
step2 = re.sub(r"([!?]){2,}", r"\1", step1)

print("Original:", text)
print("Cleaned :", step2)

Original: Hello!!!   This   is   a   test...   Really??
Cleaned : Hello! This is a test... Really?


## 9. Small Exercises

Try these exercises by writing regex patterns yourself.  
You can create new cells below each exercise and experiment.

### Exercise 1 – Find all 4-digit numbers

From the text below, extract all **4-digit numbers**.

- Hint: use `\d` and `{4}`.

In [51]:
import re

text = "Years: 1990, 95, 2001, 202, 2024, 12345"

# TODO: write a pattern to match 4-digit numbers
pattern = r"\d{4}"

matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: ['1990', '2001', '2024', '1234']


### Exercise 2 – Extract hashtags

From the text, extract all **hashtags** (words starting with `#` and followed by letters or digits).

- Hint: `#` is a normal character, but `\w` is useful.

In [52]:
import re

text = "Loving #Python, #100DaysOfCode and #regex!"

# TODO: pattern for hashtags
pattern = r"#\w+"

matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: ['#Python', '#100DaysOfCode', '#regex']


### Exercise 3 – Validate a simple password

Write a function that returns `True` if a password:

- has at least 8 characters
- contains at least one digit
- contains at least one uppercase letter

(You can solve this with multiple regex checks, or a single more advanced one if you want a challenge.)

In [53]:
import re

def is_strong_password(pw: str) -> bool:
    # TODO: implement using regex
    has_min_length = len(pw) >= 8
    has_digit = re.search(r"\d", pw) is not None
    has_upper = re.search(r"[A-Z]", pw) is not None
    return has_min_length and has_digit and has_upper

tests = ["abc", "Password", "Pass1234", "weakpass1", "STRONG123"]

for t in tests:
    print(f"{t:12} -> {is_strong_password(t)}")

abc          -> False
Password     -> False
Pass1234     -> True
weakpass1    -> False
STRONG123    -> True


## 10. Summary

In this notebook, we learned:

- What regular expressions (regex) are and **why** they are useful.
- How to use Python's built-in **`re` module**.
- Basic regex syntax: literals, character classes, quantifiers, anchors.
- Common operations:
  - `search`, `match`, `fullmatch`
  - `findall`, `finditer`
  - `sub`, `split`
  - `compile`
- How to use **groups**, **alternation**, and **flags**.
- Several practical examples and some exercises to practice.

Regex can feel strange at first, but it becomes very powerful once you practice.  
You don't have to memorize everything – you can always look up patterns when you need them. (try google or ChatGPT)

*Tip:* As you practice, try to **describe in plain English** what a pattern does.  
If you can explain it, you truly understand it.