# Understanding_Regex

# Import the 're' Module

In [1]:
import re

# Sample Text

In [4]:
text = """Regular Expressions (Regex) is a powerful tool in NLP for pattern matching and text processing.
It allows us to define search patterns using special characters and rules to find, extract, or replace specific text within large datasets. Regex is widely used for text cleaning, data validation, entity extraction, and preprocessing tasks. For example, it can be used to extract email addresses, phone numbers, dates, and specific keywords from raw text.
Key components of regex include character classes (e.g., [0-9] for digits), quantifiers (*, +, ? for repetitions), and metacharacters (\d for digits, \w for words, \s for spaces). While regex is incredibly efficient for rule-based text manipulation, complex patterns can be tricky to manage, making it important to balance flexibility and accuracy when applying it in NLP tasks."""

# Basic Regex Operations

Regex allows you to perform various operations on text, such as searching for patterns, replacing text, and splitting strings.

### Searching for a Pattern

In [6]:
pattern = r"regex"
matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: ['regex', 'regex']


### Replacing Text

In [7]:
pattern = r"fox"
replacement = "cat"
new_text = re.sub(pattern, replacement, text)
print("New Text:", new_text)

New Text: Regular Expressions (Regex) is a powerful tool in NLP for pattern matching and text processing. 
It allows us to define search patterns using special characters and rules to find, extract, or replace specific text within large datasets. Regex is widely used for text cleaning, data validation, entity extraction, and preprocessing tasks. For example, it can be used to extract email addresses, phone numbers, dates, and specific keywords from raw text. 
Key components of regex include character classes (e.g., [0-9] for digits), quantifiers (*, +, ? for repetitions), and metacharacters (\d for digits, \w for words, \s for spaces). While regex is incredibly efficient for rule-based text manipulation, complex patterns can be tricky to manage, making it important to balance flexibility and accuracy when applying it in NLP tasks.


### Splitting Strings

In [8]:
pattern = r"\s"  # Split on spaces
split_text = re.split(pattern, text)
print("Split Text:", split_text)

Split Text: ['Regular', 'Expressions', '(Regex)', 'is', 'a', 'powerful', 'tool', 'in', 'NLP', 'for', 'pattern', 'matching', 'and', 'text', 'processing.', '', 'It', 'allows', 'us', 'to', 'define', 'search', 'patterns', 'using', 'special', 'characters', 'and', 'rules', 'to', 'find,', 'extract,', 'or', 'replace', 'specific', 'text', 'within', 'large', 'datasets.', 'Regex', 'is', 'widely', 'used', 'for', 'text', 'cleaning,', 'data', 'validation,', 'entity', 'extraction,', 'and', 'preprocessing', 'tasks.', 'For', 'example,', 'it', 'can', 'be', 'used', 'to', 'extract', 'email', 'addresses,', 'phone', 'numbers,', 'dates,', 'and', 'specific', 'keywords', 'from', 'raw', 'text.', '', 'Key', 'components', 'of', 'regex', 'include', 'character', 'classes', '(e.g.,', '[0-9]', 'for', 'digits),', 'quantifiers', '(*,', '+,', '?', 'for', 'repetitions),', 'and', 'metacharacters', '(\\d', 'for', 'digits,', '\\w', 'for', 'words,', '\\s', 'for', 'spaces).', 'While', 'regex', 'is', 'incredibly', 'efficient',

# Character Classes and Quantifiers

### Character Classes

In [9]:
pattern = r"[aeiou]"
vowels = re.findall(pattern, text)
print("Vowels:", vowels)

Vowels: ['e', 'u', 'a', 'e', 'i', 'o', 'e', 'e', 'i', 'a', 'o', 'e', 'u', 'o', 'o', 'i', 'o', 'a', 'e', 'a', 'i', 'a', 'e', 'o', 'e', 'i', 'a', 'o', 'u', 'o', 'e', 'i', 'e', 'e', 'a', 'a', 'e', 'u', 'i', 'e', 'i', 'a', 'a', 'a', 'e', 'a', 'u', 'e', 'o', 'i', 'e', 'a', 'o', 'e', 'a', 'e', 'e', 'i', 'i', 'e', 'i', 'i', 'a', 'e', 'a', 'a', 'e', 'e', 'e', 'i', 'i', 'e', 'u', 'e', 'o', 'e', 'e', 'a', 'i', 'a', 'a', 'a', 'i', 'a', 'i', 'o', 'e', 'i', 'e', 'a', 'i', 'o', 'a', 'e', 'o', 'e', 'i', 'a', 'o', 'e', 'a', 'e', 'i', 'a', 'e', 'u', 'e', 'o', 'e', 'a', 'e', 'a', 'i', 'a', 'e', 'e', 'o', 'e', 'u', 'e', 'a', 'e', 'a', 'e', 'i', 'i', 'e', 'o', 'o', 'a', 'e', 'e', 'o', 'o', 'e', 'o', 'e', 'e', 'i', 'u', 'e', 'a', 'a', 'e', 'a', 'e', 'e', 'o', 'i', 'i', 'u', 'a', 'i', 'i', 'e', 'o', 'e', 'e', 'i', 'i', 'o', 'a', 'e', 'a', 'a', 'a', 'e', 'o', 'i', 'i', 'o', 'o', 'o', 'a', 'e', 'i', 'e', 'e', 'e', 'i', 'i', 'e', 'i', 'e', 'i', 'i', 'e', 'o', 'u', 'e', 'a', 'e', 'e', 'a', 'i', 'u', 'a', 'i', '

### Quantifiers

In [10]:
pattern = r"o{2}"  # Matches two consecutive 'o' characters
matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: ['oo']


# Metacharacters and Anchors

### Metacharacters

In [11]:
pattern = r"\b\w{3}\b"  # Matches three-letter words
matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: ['NLP', 'for', 'and', 'and', 'for', 'and', 'For', 'can', 'and', 'raw', 'Key', 'for', 'for', 'and', 'for', 'for', 'for', 'for', 'can', 'and', 'NLP']


### Anchors

In [13]:
pattern = r"^Regular"
matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: ['Regular']


# Grouping and Capturing

In [14]:
pattern = r"(\w+)\s(\w+)"  # Matches two consecutive words
matches = re.findall(pattern, text)
print("Matches:", matches)

Matches: [('Regular', 'Expressions'), ('is', 'a'), ('powerful', 'tool'), ('in', 'NLP'), ('for', 'pattern'), ('matching', 'and'), ('text', 'processing'), ('It', 'allows'), ('us', 'to'), ('define', 'search'), ('patterns', 'using'), ('special', 'characters'), ('and', 'rules'), ('to', 'find'), ('or', 'replace'), ('specific', 'text'), ('within', 'large'), ('Regex', 'is'), ('widely', 'used'), ('for', 'text'), ('data', 'validation'), ('entity', 'extraction'), ('and', 'preprocessing'), ('For', 'example'), ('it', 'can'), ('be', 'used'), ('to', 'extract'), ('email', 'addresses'), ('phone', 'numbers'), ('and', 'specific'), ('keywords', 'from'), ('raw', 'text'), ('Key', 'components'), ('of', 'regex'), ('include', 'character'), ('for', 'digits'), ('for', 'repetitions'), ('and', 'metacharacters'), ('d', 'for'), ('w', 'for'), ('s', 'for'), ('While', 'regex'), ('is', 'incredibly'), ('efficient', 'for'), ('based', 'text'), ('complex', 'patterns'), ('can', 'be'), ('tricky', 'to'), ('making', 'it'), ('im

# Using Regex for Tokenization

Regex is also useful for tokenizing text into words or sentences.

### Tokenizing into Words

In [15]:
pattern = r"\s"  # Split on spaces
words = re.split(pattern, text)
print("Words:", words)

Words: ['Regular', 'Expressions', '(Regex)', 'is', 'a', 'powerful', 'tool', 'in', 'NLP', 'for', 'pattern', 'matching', 'and', 'text', 'processing.', '', 'It', 'allows', 'us', 'to', 'define', 'search', 'patterns', 'using', 'special', 'characters', 'and', 'rules', 'to', 'find,', 'extract,', 'or', 'replace', 'specific', 'text', 'within', 'large', 'datasets.', 'Regex', 'is', 'widely', 'used', 'for', 'text', 'cleaning,', 'data', 'validation,', 'entity', 'extraction,', 'and', 'preprocessing', 'tasks.', 'For', 'example,', 'it', 'can', 'be', 'used', 'to', 'extract', 'email', 'addresses,', 'phone', 'numbers,', 'dates,', 'and', 'specific', 'keywords', 'from', 'raw', 'text.', '', 'Key', 'components', 'of', 'regex', 'include', 'character', 'classes', '(e.g.,', '[0-9]', 'for', 'digits),', 'quantifiers', '(*,', '+,', '?', 'for', 'repetitions),', 'and', 'metacharacters', '(\\d', 'for', 'digits,', '\\w', 'for', 'words,', '\\s', 'for', 'spaces).', 'While', 'regex', 'is', 'incredibly', 'efficient', 'for

### Tokenizing into Sentences

In [16]:
pattern = r"\.\s"  # Split after periods followed by a space
sentences = re.split(pattern, text)
print("Sentences:", sentences)

Sentences: ['Regular Expressions (Regex) is a powerful tool in NLP for pattern matching and text processing', '\nIt allows us to define search patterns using special characters and rules to find, extract, or replace specific text within large datasets', 'Regex is widely used for text cleaning, data validation, entity extraction, and preprocessing tasks', 'For example, it can be used to extract email addresses, phone numbers, dates, and specific keywords from raw text', '\nKey components of regex include character classes (e.g., [0-9] for digits), quantifiers (*, +, ? for repetitions), and metacharacters (\\d for digits, \\w for words, \\s for spaces)', 'While regex is incredibly efficient for rule-based text manipulation, complex patterns can be tricky to manage, making it important to balance flexibility and accuracy when applying it in NLP tasks.']
