Regular expressions (regex) are powerful tools for text preprocessing in tasks like data cleaning, tokenization, and standardization. Here are some commonly used regex patterns and how they are applied for different preprocessing steps:

**1. Removing Punctuation:-**
To clean out punctuation

In [1]:
import re

text = "Hello, World! This is a test."
cleaned_text = re.sub(r'[^\w\s]', '', text)
cleaned_text
# Output: 'Hello World This is a test'


'Hello World This is a test'

**2. Lowercasing All Words:-**
Uniform casing helps in reducing variations:

In [2]:
text = "HELLO World"
cleaned_text = text.lower()
cleaned_text
# Output: 'hello world'


'hello world'

**3. Removing Numbers:-**
Remove digits from the text, helpful in text analysis tasks:

In [3]:
text = "There are 2 apples and 3 oranges."
cleaned_text = re.sub(r'\d+', '', text)
cleaned_text
# Output: 'There are  apples and  oranges.'


'There are  apples and  oranges.'

**4. Removing Extra Whitespace:-**
To reduce multiple spaces to a single space:

In [4]:
text = "This   is    a    test."
cleaned_text = re.sub(r'\s+', ' ', text).strip()
# Output: 'This is a test'
cleaned_text

'This is a test.'

**5. Removing HTML Tags:-**
Useful when dealing with web-scraped data

In [5]:
text = "<p>Hello, <b>World</b>!</p>"
cleaned_text = re.sub(r'<.*?>', '', text)
# Output: 'Hello, World!'
cleaned_text

'Hello, World!'

**6. Extracting Email Addresses:-**
To find and extract email addresses

In [6]:
text = "Contact us at support@example.com or admin@domain.com"
emails = re.findall(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}', text)
# Output: ['support@example.com', 'admin@domain.com']
emails

['support@example.com', 'admin@domain.com']

**7. Extracting URLs:-**
Identifies and extracts URLs from tex

In [7]:
text = "Visit https://example.com for more details."
urls = re.findall(r'https?://\S+', text)
# Output: ['https://example.com']
urls

['https://example.com']

**8. Removing Stop Words (using regex with a predefined stop word list)**
For stop word remova

In [8]:
stop_words = r'\b(?:is|a|at|the|on|in|and|or|to|of)\b'
text = "This is a simple test."
cleaned_text = re.sub(stop_words, '', text)
# Output: 'This  simple test.'
text

'This is a simple test.'

| Preprocessing Task            | Regex Pattern                                              | Description                                                        |
|-------------------------------|----------------------------------------------------------|--------------------------------------------------------------------|
| Remove Punctuation            | `r'[^\w\s]'`                                            | Removes all characters except word characters and whitespace.      |
| Lowercase All Words           | `text.lower()`                                         | Converts all characters in the text to lowercase.                 |
| Remove Numbers                | `r'\d+'`                                               | Removes all digit characters from the text.                        |
| Remove Extra Whitespace       | `r'\s+'`                                               | Replaces multiple whitespace characters with a single space.       |
| Remove HTML Tags              | `r'<.*?>'`                                            | Removes all HTML tags from the text.                              |
| Extract Email Addresses       | `r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'`  | Matches valid email addresses.                                    |
| Extract URLs                  | `r'https?://\S+'`                                     | Matches URLs that start with http or https.                       |
| Remove Stop Words             | `r'\b(?P<stopword>\w+)\b'`                            | Matches and removes specified stop words from the text.           |
| Remove Lowercase Letters      | `r'[a-z]'`                                            | Removes all lowercase letters from the text.                      |
| Remove Uppercase Letters      | `r'[A-Z]'`                                            | Removes all uppercase letters from the text.                      |
| Remove Alphabetic Characters   | `r'[A-Za-z]'`                                         | Removes all alphabetic characters, both uppercase and lowercase.  |
| Keep Only Alphabetic Characters | `r'[^A-Za-z]'`                                       | Removes all non-alphabetic characters, keeping only letters.     |


1. **Character Classes**
   - **[abc]**: Matches any single character that is either a, b, or c.
     - **Example**: `re.search(r'[abc]', 'cat')` matches `a` in "cat".
   - **[^abc]**: Matches any character that is not a, b, or c.
     - **Example**: `re.search(r'[^abc]', 'def')` matches `d` in "def".
   - **[a-z]**: Matches any lowercase letter from a to z.
     - **Example**: `re.search(r'[a-z]', 'Hello')` matches `e` in "Hello".
   - **[A-Z]**: Matches any uppercase letter from A to Z.
     - **Example**: `re.search(r'[A-Z]', 'hello')` matches `H` in "Hello".
   - **[0-9]**: Matches any digit from 0 to 9.
     - **Example**: `re.search(r'[0-9]', 'Room 101')` matches `1` in "Room 101".

2. **Quantifiers**
   - **\***: Matches zero or more occurrences of the preceding element.
     - **Example**: `re.search(r'lo*se', 'loose')` matches `loose`.
   - **\+**: Matches one or more occurrences of the preceding element.
     - **Example**: `re.search(r'lo+se', 'loose')` matches `loose`.
   - **\?**: Matches zero or one occurrence of the preceding element.
     - **Example**: `re.search(r'lo?se', 'lse')` matches `lse`.
   - **{n}**: Matches exactly n occurrences of the preceding element.
     - **Example**: `re.search(r'o{2}', 'foo')` matches `oo` in "foo".
   - **{n,}**: Matches n or more occurrences of the preceding element.
     - **Example**: `re.search(r'o{2,}', 'fooo')` matches `oo` in "fooo".
   - **{n,m}**: Matches between n and m occurrences of the preceding element.
     - **Example**: `re.search(r'o{1,2}', 'fooo')` matches `o` in "fooo".

3. **Anchors**
   - **^**: Matches the start of the string.
     - **Example**: `re.search(r'^Hello', 'Hello World')` matches `Hello`.
   - **$**: Matches the end of the string.
     - **Example**: `re.search(r'World$', 'Hello World')` matches `World`.

4. **Special Sequences**
   - **\d**: Matches any digit (equivalent to [0-9]).
     - **Example**: `re.search(r'\d', 'Room 101')` matches `1`.
   - **\D**: Matches any non-digit character.
     - **Example**: `re.search(r'\D', 'Room 101')` matches `R`.
   - **\w**: Matches any alphanumeric character (equivalent to [a-zA-Z0-9_]).
     - **Example**: `re.search(r'\w', 'Hello!')` matches `H`.
   - **\W**: Matches any non-alphanumeric character.
     - **Example**: `re.search(r'\W', 'Hello!')` matches `!`.
   - **\s**: Matches any whitespace character (spaces, tabs, newlines).
     - **Example**: `re.search(r'\s', 'Hello World')` matches the space between "Hello" and "World".
   - **\S**: Matches any non-whitespace character.
     - **Example**: `re.search(r'\S', ' Hello ')` matches `H`.

5. **Grouping and Capturing**
   - **(abc)**: Matches the exact sequence "abc".
     - **Example**: `re.search(r'(abc)', 'abc def')` matches `abc`.
   - **a(bc)**: Matches "a" followed by "bc", capturing "bc".
     - **Example**: `re.search(r'a(bc)', 'abc')` captures `bc`.
   - **(?:abc)**: Matches "abc" but does not capture the group.
     - **Example**: `re.search(r'(?:abc)', 'abc def')` matches `abc` without capturing.

6. **Alternation**
   - **a|b**: Matches either "a" or "b".
     - **Example**: `re.search(r'a|b', 'cat')` matches `a`.
   - **cat|dog**: Matches either "cat" or "dog".
     - **Example**: `re.search(r'cat|dog', 'I have a dog')` matches `dog`.

7. **Lookahead and Lookbehind**
   - **(?=abc)**: Positive lookahead for "abc".
     - **Example**: `re.search(r'(?=abc)', 'abc def')` matches the position before `abc`.
   - **(?!abc)**: Negative lookahead; asserts "abc" does not follow.
     - **Example**: `re.search(r'(?!abc)', 'def')` matches the position in `def`.
   - **(?<=abc)**: Positive lookbehind for "abc".
     - **Example**: `re.search(r'(?<=abc)', 'abcdef')` matches the position after `abc`.
   - **(?<!abc)**: Negative lookbehind; asserts "abc" does not precede.
     - **Example**: `re.search(r'(?<!abc)', 'def')` matches the position in `def`.

8. **Flags**
   - **re.IGNORECASE**: Makes the pattern case-insensitive.
     - **Example**: `re.search(r'hello', 'Hello', re.IGNORECASE)` matches `Hello`.
   - **re.MULTILINE**: Allows `^` and `$` to match start and end of lines.
     - **Example**: `re.findall(r'^\w+', 'Line 1\nLine 2', re.MULTILINE)` matches `Line` in both lines.
   - **re.DOTALL**: Allows `.` to match newline characters.
     - **Example**: `re.search(r'.+', 'First line\nSecond line', re.DOTALL)` matches the entire text.


In [10]:
import re

# Sample text
text = 'Hello World This is a test'

# Character Classes
print("Character Classes:")
print("Matches [abc]:", re.search(r'[abc]', text).group())  # Matches any a, b, or c
print("Matches [^abc]:", re.search(r'[^abc]', text).group())  # Matches any character that is not a, b, or c
print("Matches [a-z]:", re.search(r'[a-z]', text).group())  # Matches any lowercase letter
print("Matches [A-Z]:", re.search(r'[A-Z]', text).group())  # Matches any uppercase letter
print("Matches [0-9]:", re.search(r'[0-9]', text + ' 1').group())  # Matches any digit

# Quantifiers
print("\nQuantifiers:")
print("Matches lo*se:", re.search(r'lo*se', 'loose').group())  # Matches zero or more occurrences of 'o'
print("Matches lo+se:", re.search(r'lo+se', 'loose').group())  # Matches one or more occurrences of 'o'
print("Matches lo?se:", re.search(r'lo?se', 'lse').group())  # Matches zero or one occurrence of 'o'
print("Matches o{2}:", re.search(r'o{2}', 'foo').group())  # Matches exactly 2 occurrences of 'o'
print("Matches o{2,}:", re.search(r'o{2,}', 'fooo').group())  # Matches 2 or more occurrences of 'o'
print("Matches o{1,2}:", re.search(r'o{1,2}', 'fooo').group())  # Matches 1 or 2 occurrences of 'o'

# Anchors
print("\nAnchors:")
print("Matches ^Hello:", re.search(r'^Hello', text).group())  # Matches if starts with "Hello"

# Check for match before accessing group
match = re.search(r'World$', text)
if match:
    print("Matches World$:", match.group())  # Matches if ends with "World"
else:
    print("No match found for World$.")

# Special Sequences
print("\nSpecial Sequences:")
print("Matches \\d:", re.search(r'\d', 'Room 101').group())  # Matches any digit
print("Matches \\D:", re.search(r'\D', 'Room 101').group())  # Matches any non-digit
print("Matches \\w:", re.search(r'\w', 'Hello!').group())  # Matches any alphanumeric character
print("Matches \\W:", re.search(r'\W', 'Hello!').group())  # Matches any non-alphanumeric character
print("Matches \\s:", re.search(r'\s', 'Hello World').group())  # Matches any whitespace
print("Matches \\S:", re.search(r'\S', ' Hello ').group())  # Matches any non-whitespace

# Grouping and Capturing
print("\nGrouping and Capturing:")
print("Matches (abc):", re.search(r'(abc)', 'abc def').group())  # Matches "abc"
print("Matches a(bc):", re.search(r'a(bc)', 'abc').group(1))  # Captures "bc"
print("Matches (?:abc):", re.search(r'(?:abc)', 'abc def').group())  # Matches "abc" without capturing

# Alternation
print("\nAlternation:")
print("Matches a|b:", re.search(r'a|b', 'cat').group())  # Matches "a" or "b"
print("Matches cat|dog:", re.search(r'cat|dog', 'I have a dog').group())  # Matches "dog"

# Lookahead and Lookbehind
print("\nLookahead and Lookbehind:")
print("Matches (?=abc):", re.search(r'(?=abc)', 'abc def'))  # Positive lookahead (position before "abc")
print("Matches (?!abc):", re.search(r'(?!abc)', 'def'))  # Negative lookahead (position in "def")
print("Matches (?<=abc):", re.search(r'(?<=abc)', 'abcdef'))  # Positive lookbehind (position after "abc")
print("Matches (?<!abc):", re.search(r'(?<!abc)', 'def'))  # Negative lookbehind (position in "def")

# Flags
print("\nFlags:")
print("Matches re.IGNORECASE:", re.search(r'hello', 'Hello', re.IGNORECASE).group())  # Case-insensitive
print("Matches re.MULTILINE:", re.findall(r'^\w+', 'Line 1\nLine 2', re.MULTILINE))  # Matches start of lines
print("Matches re.DOTALL:", re.search(r'.+', 'First line\nSecond line', re.DOTALL).group())  # Matches entire text


Character Classes:
Matches [abc]: a
Matches [^abc]: H
Matches [a-z]: e
Matches [A-Z]: H
Matches [0-9]: 1

Quantifiers:
Matches lo*se: loose
Matches lo+se: loose
Matches lo?se: lse
Matches o{2}: oo
Matches o{2,}: ooo
Matches o{1,2}: oo

Anchors:
Matches ^Hello: Hello
No match found for World$.

Special Sequences:
Matches \d: 1
Matches \D: R
Matches \w: H
Matches \W: !
Matches \s:  
Matches \S: H

Grouping and Capturing:
Matches (abc): abc
Matches a(bc): bc
Matches (?:abc): abc

Alternation:
Matches a|b: a
Matches cat|dog: dog

Lookahead and Lookbehind:
Matches (?=abc): <re.Match object; span=(0, 0), match=''>
Matches (?!abc): <re.Match object; span=(0, 0), match=''>
Matches (?<=abc): <re.Match object; span=(3, 3), match=''>
Matches (?<!abc): <re.Match object; span=(0, 0), match=''>

Flags:
Matches re.IGNORECASE: Hello
Matches re.MULTILINE: ['Line', 'Line']
Matches re.DOTALL: First line
Second line


In [11]:
print("The End")

The End
