# Regex

Regex (Regular Expressions) is a sequence of characters that defines a search pattern. In NLP, it's used to:

- Clean text
- Extract patterns (emails, dates, hashtags)
- Find or replace specific patterns

## Import re Library


## 1. Character Classes

**. → Any character except newline**

In [2]:
import re
re.findall(r'c.t', 'cat cut cot c t')

['cat', 'cut', 'cot', 'c t']

**\w \d \s → Word, digit, whitespace**

In [3]:
text = """
General text7 with multiple words:

This is a simple text string.

Test string with numbers 12345 and symbols !@#$

Regex can be tricky at times, but it's powerful.

Look out for 2025, it's going to be an exciting year!

Call me at 555-123-4567 for more info.
"""

re.findall(r"\d+", text)

['7', '12345', '2025', '555', '123', '4567']

In [None]:
re.findall(r'\w+', 'Regex101 is #1!')
# ['Regex101', 'is', '1']

re.findall(r'\d+', 'Year: 2025')
# ['2025']

re.findall(r'\s+', 'a   b')
# ['   ']

**\W \D \S → Not word, digit, whitespace**

In [None]:
re.findall(r'\W+', 'A&B*C!')
# ['&', '*', '!']

re.findall(r'\D+', 'Call 911 now!')
# ['Call ', ' now!']

re.findall(r'\S+', 'text   with   spaces')
# ['text', 'with', 'spaces']

**[abc] → Any of a, b, or c**


In [8]:
re.findall(r'[abc]', 'Apple banana carrot', re.I)

['A', 'b', 'a', 'a', 'a', 'c', 'a']

**[^abc] → Not a, b, or c**
[a-g] → Character between a and g ``(re.findall(r'[a-g]', 'abcdefgzxy'))``

## 2. Anchors

**^abc$ → Start & end of string**

In [4]:
re.match(r'^abc$', 'abc')  # full string is "abc"
# Match object exists

re.match(r'^abc$', 'abcd')  # does not match full string
# None


**\b and \B → Word boundary and not word boundary**

In [7]:
re.findall(r'\bcat\b', 'cat catalog category')

re.findall(r'\Bcat\B', 'educate location')


['cat', 'cat']

## 3. Escaped Characters
**\. \* \\ → Escape regex symbols**

In [8]:
re.findall(r'\.', 'test.example.com')
# ['.', '.']

re.findall(r'\*', 'a * b * c')

['*', '*']

**\t \n \r → Tabs, newlines, carriage returns**

In [11]:
text = 'first\tsecond\nthird\rfourth'
re.findall(r'\t|\n|\r', text)


['\t', '\n', '\r']

## 4. Groups & Lookaround

**(abc) → Capture group**

In [1]:
import re

txt = """Serial numbers: A1234B, A5678C, Z4321X.
Account number: 987654321 Name: Gagan Surname: Puri
Account number: 78969873"""

re.findall(r"Name: (\w+)| Surname: (\w+)", txt)

[('Gagan', ''), ('', 'Puri')]

In [5]:
re.findall(r'(ha)+', 'hahaha')

['ha']

**\1 → Backreference**

In [14]:
re.findall(r'(\w+)\s+\1', 'hello hello world world')

['hello', 'world']

**(?:abc) → Non-capturing group**

In [4]:
txt = """Serial numbers: A1234B, A5678C, Z4321X.
Account number: 987654321 Name: Gagan Surname: -=-=-
Account number: 78969873"""

re.findall(r"(?:Account number: )\d+", txt)

['Account number: 987654321', 'Account number: 78969873']

In [15]:
re.findall(r'(?:ha)+(ga)', 'hahahaga')

['ga']

**(?=abc) → Positive lookahead**

In [16]:
re.findall(r'\w+(?=ing)', 'eating running played')

['eat', 'runn']

In [None]:
# [a-z_]+\d+@[a-z]+\.[a-z{2,}]
# rupa_shrestha@gmail.com

In [None]:
re.findall(r'', '')

In [None]:
# \d{2,4}[\/\-]\d{2}[\/\-]\d{2,4}

# John's phone number is (123) 456-7890 and his office line is 123-456-7890. 
# He was born on 1990-12-31 and got his license on 31/12/1990 or 31-12-1990. 
# Send an email to john.doe@example.com or support@company.co.uk. 
# Visit our website at https://www.example.com or follow http://blog.example.com/articles?id=123.
 
# His IP address is 192.168.1.1 and his backup server is at 10.0.0.254.
# The color theme used was #FF5733, #00ffcc, and #000.
 
# He paid $123.45 for the order using his credit card 4111-1111-1111-1111, expiring 12/25.
# Zip code is 90210 and his alternate is 10001-0001.
 
# The meeting is scheduled for 09:30 AM and may go until 17:45.
# The config file is at C:\Users\JohnDoe\config.txt or /etc/config/settings.json.
 
# On social media, he goes by @johnny and tags his posts with #regexFun and #coding.
# The order total was €99.99, while another invoice shows £100.00 and ₹7500.
 
# Tracking ID: AB123456789CD and customer ID: CUST-00123-XYZ.
# His blog uses <div class="post">HTML elements</div> frequently.
 
# Device UUID: 550e8400-e29b-41d4-a716-446655440000.
# His SSN is 123-45-6789 but should be kept private.
 
# Notes: Some random strings to test:
# abc123, X9Z8W7, 42-42-42, this_is-a_test, ends.with.dot.