## Mastering Regex: The Hidden Superpower of Text Processing

##### *Introduction*

Regex (Regular Expressions) is one of those tools that seems complex at first, but once you get it, you unlock a whole new level of text processing. I started learning Regex because I was curious about how chatbots and NLP models work. And trust me, it’s a game-changer! If you’ve ever struggled with searching, extracting, or validating text, Regex is your best friend. Let’s dive into what I learned today.

##### *Understanding Strings in Python*

- Raw String
- Bytes Litral
- Unicode String

Raw String:   Use raw strings when you want to avoid escaping backslashes, particularly in regular expressions or Windows file paths. In a raw string, backslashes (\\) are treated as literal characters, which makes it especially useful for defining regex patterns.

example:

In [1]:
# Creating a raw string
raw_string = r"C:\Users\Name\Documents"
print("Raw String : ",raw_string)

Raw String :  C:\Users\Name\Documents


Bytes literals: Used to handle binary data, such as images, audio files, and other non-text data. They are sequences of integers in the range of 0 to 255, representing raw byte values. Bytes literals are particularly useful when working with binary files, network communications, or when you need to encode or decode data.

In [2]:
# Creating a bytes literal
bytes_string = b"Hello, World!"

# Accessing individual bytes
print("Bytes String       : ", bytes_string) 
print("ASCII value of 'H' : ", bytes_string[0]) 

# Converting bytes to a string
decoded_string = bytes_string.decode('utf-8')
print("Normal String      : ", decoded_string)  # Output: Hello, World!

Bytes String       :  b'Hello, World!'
ASCII value of 'H' :  72
Normal String      :  Hello, World!


Unicode String: Use Unicode strings when you need to represent text that includes characters from various languages and symbol sets. Unicode strings are prefixed with a u or U in Python 2, but in Python 3, all string literals are Unicode by default. This means that Unicode strings can handle a wide range of characters, including letters, numbers, punctuation, and special symbols from different languages. Unicode is essential for ensuring that text is displayed correctly across different platforms and systems, making it particularly useful for internationalization and localization of applications.

In [3]:
# Creating a unicode string
unicode_string = u"Hello, 🌍!"  # Contains a Unicode emoji
print("Unicode String : ", unicode_string)


Unicode String :  Hello, 🌍!


Formatted String: Use formatted strings when you want to easily include variables or expressions in a string. Formatted strings allow you to place variables inside curly braces {}. This makes it simple to create dynamic messages without complicated formatting. They are great for generating text that includes variable values.

In [4]:
# Creating a formated string
name = "Anto"
age = 23
formatted_string = f"My name is {name} and I am {age} years old."
print("Formated String : ", formatted_string)

Formated String :  My name is Anto and I am 23 years old.


*Regex Basics: The Building Blocks*

- Metacharacters & Escape Sequences
- Regex Operations in Python (re Module)
- Quantifiers: Controlling Repetitions
- Grouping & Capturing: The Power of ()
- Greedy vs. Lazy Matching


Metacharacters & Escape Sequences:

. (Dot):

Description --> The dot matches any single character except for a newline character (\n). It is a wildcard that can represent any character in a string.

Example: In the pattern **c.t**, the dot can match any character between 'c' and 't', such as 'a', 'b', or 'x'. So it would match "cat", "cut", "c1t", etc.

In [5]:
import re

text = "cat bat rat"
pattern = r"c.t"  # Matches any three-character string starting with 'c' and ending with 't'

matches = re.findall(pattern, text)
print("Matched String : ", matches)

Matched String :  ['cat']


^ (Caret):

Description --> The caret matches the start of a string. It asserts that the following pattern must occur at the beginning of the string.

Example: In the pattern ^Hello, it will match any string that starts with "Hello", such as "Hello, World!" or "Hello there!". It will not match "Say Hello".

In [6]:
import re

text = "Hello, World!"
pattern = r"^Hello"  # Matches if the string starts with 'Hello'

match = re.match(pattern, text)
print("Matched String : ", match)


Matched String :  <re.Match object; span=(0, 5), match='Hello'>


$ (Dollar): 

Description --> The dollar sign matches the end of a string. It asserts that the preceding pattern must occur at the end of the string.

Example: In the pattern World!$, it will match any string that ends with "World!", such as "Hello, World!" or "Goodbye, World!". It will not match "World! is great".

In [7]:
import re

text = "Hello, World!"
pattern = r"World!$"  # Matches if the string ends with 'World!'

match = re.search(pattern, text)
print("Matched String : ", match) 

Matched String :  <re.Match object; span=(7, 13), match='World!'>


[] (Character Set):

Description --> The character set matches any single character that is included within the brackets. You can specify a range of characters or individual characters.

Example: In the pattern [abc]pple, it will match "apple" (because it starts with 'a'), "bpple" (because it starts with 'b'), or "cpple" (because it starts with 'c'). It will not match "dpple".

In [8]:
import re

text = "apple banana cherry"
pattern = r"[abc]pple"  # Matches 'apple' if it starts with 'a', 'b', or 'c'

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['apple']


Escape sequences make pattern matching easier:

 \d (Digit):

Description --> Matches any digit from 0 to 9.

In [9]:
import re

text = "There are 3 apples and 5 oranges."
pattern = r"\d"  # Matches any single digit

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['3', '5']


 \D (Non-Digit):

Description --> Matches any character that is not a digit (0-9).

In [10]:
import re

text = "Room 101 is on the 1st floor."
pattern = r"\D"  # Matches any non-digit character

matches = re.findall(pattern, text)
print("Matched String : ", matches) 


Matched String :  ['R', 'o', 'o', 'm', ' ', ' ', 'i', 's', ' ', 'o', 'n', ' ', 't', 'h', 'e', ' ', 's', 't', ' ', 'f', 'l', 'o', 'o', 'r', '.']


 \w (Word Character):

Description --> Matches any word character, which includes letters (a-z, A-Z), digits (0-9), and underscores (_).

In [11]:
import re

text = "Hello_World123!"
pattern = r"\w"  # Matches any word character

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['H', 'e', 'l', 'l', 'o', '_', 'W', 'o', 'r', 'l', 'd', '1', '2', '3']


\W (Non-Word Character):

Description --> Matches any character that is not a word character (anything except a-z, A-Z, 0-9, and _).

In [12]:
import re

text = "Hello, World! 123."
pattern = r"\W"  # Matches any non-word character

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  [',', ' ', '!', ' ', '.']


\s (Whitespace):

Description --> Matches any whitespace character, including spaces, tabs, and newlines.

In [13]:
import re

text = "Hello,\nWorld!  This is a test."
pattern = r"\s"  # Matches any whitespace character

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['\n', ' ', ' ', ' ', ' ', ' ']


 \S (Non-Whitespace):

Description --> Matches any character that is not a whitespace character.

In [14]:
import re

text = "Hello, World! 123."
pattern = r"\S"  # Matches any non-whitespace character

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['H', 'e', 'l', 'l', 'o', ',', 'W', 'o', 'r', 'l', 'd', '!', '1', '2', '3', '.']


Regex Operations in Python (re Module)

- findall()
- search()
- match()

findall():

Description --> The findall() function returns a list of all non-overlapping matches of a pattern in a string. It scans the entire string and finds **all occurrences** that match the specified regex pattern.

In [15]:
import re

text = "The rain in Spain falls mainly in the plain."
pattern = r"ain"  # Pattern to search for

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['ain', 'ain', 'ain', 'ain']


search():

Description --> The search() function scans through a string and returns the **first match of the pattern**. If a match is found, it returns a match object; otherwise, it returns None.

In [16]:
import re

text = "The rain in Spain falls mainly in the plain."
pattern = r"Spain"  # Pattern to search for

match = re.search(pattern, text)
if match:
    print("Match found:", match.group())
else:
    print("No match found.")

Match found: Spain


match():

Description --> The match() function checks for a match only at the **beginning of the string**. It returns a match object if the pattern matches the start of the string; otherwise, it returns None.

In [17]:
import re

text = "The rain in Spain falls mainly in the plain."
pattern = r"The"  # Pattern to search for

match = re.match(pattern, text)
if match:
    print("Match found:", match.group())
else:
    print("No match found.")

Match found: The


Quantifiers: Controlling Repetitions

- \+ 
- \* 
- \? 
- \{n,m}

 \+ (One or More):

Description --> The **'+'** quantifier matches **one or more occurrences** of the preceding element. It requires at least one match to be present.

In [18]:
import re

text = "a aa aaa aaaa"
pattern = r"a+"  # Matches one or more 'a's

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['a', 'aa', 'aaa', 'aaaa']


\* (Zero or More):

Description --> The **'*'** quantifier matches **zero or more occurrences** of the preceding element. It can match an empty string as well.

In [19]:
import re

text = "a aa aaa aaaa"
pattern = r"a*"  # Matches zero or more 'a's

matches = re.findall(pattern, text)
print("Matched String : ", matches) 

Matched String :  ['a', '', 'aa', '', 'aaa', '', 'aaaa', '']


? (Zero or One):

Description --> The **'?'** quantifier matches **zero or one occurrence** of the preceding element. **It makes the preceding element optional**.

In [20]:
import re

text = "a aa aaa aaaa"
pattern = r"a?"  # Matches zero or one 'a'

matches = re.findall(pattern, text)
print("Matched String : ", matches)

Matched String :  ['a', '', 'a', 'a', '', 'a', 'a', 'a', '', 'a', 'a', 'a', 'a', '']


\{n,m\} (Min, Max):

Description --> The **'{n,m}'** quantifier matches **between n and m occurrences** of the preceding element. **It requires at least n matches and allows up to m matches**.

In [21]:
import re

text = "a aa aaa aaaa aaaaa"
pattern = r"a{2,4}"  # Matches between 2 and 4 'a's

matches = re.findall(pattern, text)
print("Matched String : ", matches)

Matched String :  ['aa', 'aaa', 'aaaa', 'aaaa']


Grouping & Capturing: The Power of ()

Parentheses () allow you to group parts of your pattern and capture matched text for later use. This is useful when extracting specific parts of a string.

In [22]:
import re

text = "Order number: 12345, Date: 2023-10-01"
pattern = r"Order number: (\d+), Date: (\d{4}-\d{2}-\d{2})"  # Captures order number and date

match = re.search(pattern, text)
if match:
    order_number = match.group(1)  # First captured group
    order_date = match.group(2)     # Second captured group
    print(f"Order Number: {order_number}, Order Date: {order_date}")

Order Number: 12345, Order Date: 2023-10-01


Greedy vs. Lazy Matching

Greedy Matchers

Description --> Greedy matchers **try to match as much text as possible**. They will expand their match as far as they can while still allowing the overall pattern to succeed.

In [23]:
import re

text = "abc ac abced"
pattern = r"a.*c"  # Greedy match

match = re.search(pattern, text)
print(match.group())

abc ac abc


Lazy Matchers

Description --> Lazy matchers (also known as non-greedy matchers) **try to match as little text as possible**. They will stop matching as soon as they can while still allowing the overall pattern to succeed.

*? (Zero or More, Lazy)

+? (one or More, Lazy)

In [24]:
import re

text = "abc ac abced"
pattern = r"a.*?c"  # Greedy match

match = re.search(pattern, text)
print(match.group())

abc


Fun Fact: Where Regex Came From?

Did you know Regex dates back to the 1950s? It was introduced by mathematician Stephen Kleene as part of automata theory. Today, it powers everything from search engines to AI models!