# Implementing RegEx on NLP - Preprocessing Techniques

## a. Extract Words Starting with Uppercase Letter

Extract all words starting with an uppercase letter from the Alice in Wonderland text.

In [1]:
import re

# Text from Alice in Wonderland
text_alice = """Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do. Once or twice she had peeped into the book
her sister was reading, but it had no pictures or conversations in it, "and
what is the use of a book," thought Alice, "without pictures or
conversations?"""

# RegEx pattern to match words starting with uppercase letter
# \b - word boundary (start of word)
# [A-Z] - matches exactly one uppercase letter
# [a-z]* - matches zero or more lowercase letters after the uppercase
pattern_uppercase = r'\b[A-Z][a-z]*'

# Find all words starting with uppercase
uppercase_words = re.findall(pattern_uppercase, text_alice)

print("=== Words Starting with Uppercase Letter ===")
print(f"RegEx pattern: {pattern_uppercase}")
print(f"\nExtracted words: {uppercase_words}")
print(f"\nTotal count: {len(uppercase_words)} words")

=== Words Starting with Uppercase Letter ===
RegEx pattern: \b[A-Z][a-z]*

Extracted words: ['Alice', 'Once', 'Alice']

Total count: 3 words


## b. Extract and Replace Whale Variations

Extract all instances of "Whale", "Whales", "whale", and "whales", then replace the first 10 instances with "leviathan".

In [7]:
# Read the Moby Dick text file
with open('melville-moby_dick.txt', 'r', encoding='utf-8') as file:
    moby_dick_text = file.read()

# RegEx pattern to match Whale, Whales, whale, whales
# \b - word boundary to match whole words only
# [Ww] - matches either W or w
# hales? - matches "hale" followed by optional "s"
pattern_whale = r'\b[Ww]hales?\b'

# Find all instances of whale variations
whale_instances = re.findall(pattern_whale, moby_dick_text)

print("=== Extract All Whale Variations ===")
print(f"RegEx pattern: {pattern_whale}")
print(f"\nTotal instances found: {len(whale_instances)}")
print(f"First 20 instances: {whale_instances[:20]}")

# Replace the first 10 instances with "leviathan"
# Using a counter to track replacements
replacement_count = 0
def replace_first_10(match):
    global replacement_count
    if replacement_count < 10:
        replacement_count += 1
        return "leviathan"
    return match.group(0)

modified_text = re.sub(pattern_whale, replace_first_10, moby_dick_text)

print(f"\n=== Replacement Complete ===")
print(f"Number of replacements made: {replacement_count}")

# Find and show the context around the replacements
print(f"\n--- Showing sections with 'leviathan' replacements ---")
for i, match in enumerate(re.finditer(r'.{0,100}leviathan.{0,100}', modified_text), 1):
    print(f"\n{i}. ...{match.group().strip()}...")
    if i >= 10:  # Show first 10 occurrences of leviathan
        break

=== Extract All Whale Variations ===
RegEx pattern: \b[Ww]hales?\b

Total instances found: 1489
First 20 instances: ['Whale', 'Whale', 'Whale', 'Whales', 'Whales', 'Whales', 'Whale', 'Whale', 'Whale', 'Whale', 'Whale', 'Whale', 'Whale', 'Whale', 'Whale', 'Whale', 'whale', 'whales', 'whale', 'whales']

=== Replacement Complete ===
Number of replacements made: 10

--- Showing sections with 'leviathan' replacements ---

1. ...The Project Gutenberg eBook of Moby Dick; Or, The leviathan...

2. ...Title: Moby Dick; Or, The leviathan...

3. ...CHAPTER 42. The Whiteness of the leviathan....

4. ...CHAPTER 55. Of the Monstrous Pictures of leviathan....

5. ...CHAPTER 56. Of the Less Erroneous Pictures of leviathan, and the True...

6. ...CHAPTER 57. Of leviathan in Paint; in Teeth; in Wood; in Sheet-Iron; in...

7. ...CHAPTER 61. Stubb Kills a leviathan....

8. ...CHAPTER 65. The leviathan as a Dish....

9. ...CHAPTER 73. Stubb and Flask kill a Right leviathan; and Then Have a Talk...

10. ...C

## c. Extract Jack Sparrow's Lines from Pirates Text

Download NLTK, import webtext corpus, load pirates.txt, and extract all lines spoken by Jack Sparrow.

In [3]:
# Install NLTK if not already installed
# Uncomment the line below to install
# !pip install nltk

import re
import nltk

# Download the webtext corpus (only needs to be run once)
nltk.download('webtext', quiet=True)

# Import webtext from nltk.corpus
from nltk.corpus import webtext

# Load pirates.txt
pirates_text = webtext.raw('pirates.txt')

print("=== Pirates Text Loaded ===")
print(f"Total length: {len(pirates_text)} characters")
print(f"\n--- First 500 characters ---")
print(pirates_text[:500])

=== Pirates Text Loaded ===
Total length: 95368 characters

--- First 500 characters ---
PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terry Rossio
[view looking straight down at rolling swells, sound of wind and thunder, then a low heartbeat]
Scene: PORT ROYAL
[teacups on a table in the rain]
[sheet music on music stands in the rain]
[bouquet of white orchids, Elizabeth sitting in the rain holding the bouquet]
[men rowing, men on horseback, to the sound of thunder]
[EITC logo on flag blowing in the wind]
[many rowboats are entering the harbor]
[Elizabeth sitting alon


In [8]:
# RegEx pattern to extract Jack Sparrow's lines
# Pattern explanation:
# JACK\s+SPARROW\s*: - matches "JACK SPARROW" followed by optional whitespace and a colon
# \s* - matches optional whitespace after the colon
# (.+?) - captures the dialogue (non-greedy, at least one character)
# (?=\n|$) - lookahead: stops at newline or end of string

pattern_jack = r'JACK\s+SPARROW\s*:\s*(.+?)(?=\n|$)'

# Find all lines spoken by Jack Sparrow
jack_lines = re.findall(pattern_jack, pirates_text, re.IGNORECASE)

print("=== Jack Sparrow's Lines ===")
print(f"RegEx pattern: {pattern_jack}")
print(f"\nTotal lines spoken by Jack: {len(jack_lines)}")
print(f"\n--- ALL {len(jack_lines)} Lines by Captain Jack Sparrow ---\n")

for i, line in enumerate(jack_lines, 1):
    # Clean up the line (remove extra whitespace)
    cleaned_line = line.strip()
    print(f"{i}. {cleaned_line}")

=== Jack Sparrow's Lines ===
RegEx pattern: JACK\s+SPARROW\s*:\s*(.+?)(?=\n|$)

Total lines spoken by Jack: 193

--- ALL 193 Lines by Captain Jack Sparrow ---

1. Sorry, mate.
2. Mind if we make a little side trip? I didn't think so.
3. Complications arose, ensued, were overcome.
4. Mm-hmm!
5. Shiny?
6. Is that how you're all feeling, then? Perhaps dear old Jack is not serving your best interests as captain?
7. What did the bird say?
8. Ohhh!
9. It does me.
10. No! Much more better. It is a *drawing* of a key.
11. Gentlemen, what do keys do?
12. No! If we don't have the key, we can't open whatever it is we don't have that it unlocks. So what purpose would be served in finding whatever need be unlocked, which we don't have, without first having found the key what unlocks it?
13. You're not making any sense at all. Any more questions?
14. Hah! A heading. Set sail in a... mmm... a general... in *that* way - direction.
15. Come on, snap to and make sail, you know how this works. Come on, o