# Natural Language Processing: Words, Tokens, and Regular Expressions
## Exercises Notebook - Session 2

This notebook contains exercises covering:
- Word and token counting
- Regular expressions in Python
- Unicode handling
- Unix text processing tools
- Tokenization concepts

---
## Section 1: Word and Token Counting
---

### Exercise 1.1: Counting Words

Given the sentence from the slides: *"They picnicked by the pool, then lay back on the grass and looked at the stars."*

Write Python code to:
1. Count words excluding punctuation
2. Count words including punctuation as separate tokens
3. Count unique word types (case-insensitive)

In [None]:
# YOUR CODE HERE
sentence = "They picnicked by the pool, then lay back on the grass and looked at the stars."

# 1. Count words excluding punctuation

# 2. Count words including punctuation

# 3. Count unique types (case-insensitive)

### Exercise 1.2: Handling Disfluencies

The slides show the utterance: *"I do uh main- mainly business data processing"*

Write code to:
1. Count all tokens including disfluencies
2. Remove filled pauses (uh, um) and fragments (words ending with -)
3. Count "clean" words

In [None]:
# YOUR CODE HERE
utterance = "I do uh main- mainly business data processing"

---
## Section 2: Regular Expressions
---

### Exercise 2.1: Basic Pattern Matching

Using the regex patterns from the slides, write code to:
1. Find all words starting with capital letters
2. Find all digits in a text
3. Find words that are NOT capitalized

In [None]:
# YOUR CODE HERE
text = "Chapter 1: Down the Rabbit Hole. Alice was 7 years old in 1865."

### Exercise 2.2: Kleene Star and Plus

The slides explain Kleene operators (* and +).

Write patterns to match:
1. "baa" followed by zero or more 'a's (baa, baaa, baaaa...)
2. One or more digits
3. Any sequence of characters (wildcard)

In [None]:
# YOUR CODE HERE
test_strings = ["ba", "baa", "baaa", "baaaa", "baaaaa"]

### Exercise 2.3: Anchors and Word Boundaries

Write patterns using anchors (^, $, \b) to:
1. Match lines starting with a capital letter
2. Match lines ending with a period
3. Find the word "the" as a complete word (not in "other" or "there")

In [None]:
# YOUR CODE HERE
lines = [
    "The quick brown fox",
    "jumped over the lazy dog.",
    "other animals watched.",
    "There was nothing else."
]

### Exercise 2.4: Substitutions and Capture Groups

The slides show date format conversion. Write code to:
1. Convert US dates (mm/dd/yyyy) to EU format (dd-mm-yyyy)
2. Anonymize email addresses by replacing with [EMAIL]
3. Convert "I'm" contractions to "I am"

In [None]:
# YOUR CODE HERE
text_dates = "Meeting on 10/15/2011 and 03/22/2024"
text_emails = "Contact john@example.com or jane@test.org"
text_contractions = "I'm happy and I'm excited"

### Solution 2.4

In [None]:
import re

# 1. Date format conversion (exact example from slides)
text_dates = "The date is 10/15/2011"
converted = re.sub(r'(\d{2})/(\d{2})/(\d{4})', r'\2-\1-\3', text_dates)
print(f"1. US to EU date: '{text_dates}' -> '{converted}'")

# 2. Email anonymization
text_emails = "Contact john@example.com or jane@test.org"
anon = re.sub(r'\b[\w.-]+@[\w.-]+\.\w+\b', '[EMAIL]', text_emails)
print(f"\n2. Anonymized: '{anon}'")

# 3. Contraction expansion
text_contractions = "I'm happy and I'm excited"
expanded = re.sub(r"I'm", "I am", text_contractions)
print(f"\n3. Expanded: '{expanded}'")

# EXPLANATION (Slides reference):
# Capture groups use parentheses () to store matched values.
# In the replacement string:
# - \1 refers to first captured group (month)
# - \2 refers to second captured group (day)
# - \3 refers to third captured group (year)
# The slides show: re.sub(r"(\d{2})/(\d{2})/(\d{4})", r"\2-\1-\3", string)

### Exercise 2.5: ELIZA-style Pattern Matching

The slides show ELIZA, an early chatbot using regex.
Implement simple ELIZA rules for:
1. "I'm [emotion]" -> "WHY DO YOU THINK YOU ARE [emotion]"
2. "I need [something]" -> "WHAT WOULD IT MEAN IF YOU GOT [something]"
3. Any input with "always" -> "CAN YOU THINK OF A SPECIFIC EXAMPLE?"

In [None]:
# YOUR CODE HERE
def simple_eliza(user_input):
    # Implement ELIZA rules
    pass

# Test
test_inputs = [
    "I'm depressed",
    "I need a vacation", 
    "They always ignore me"
]

---
## Section 3: Unicode and Encoding
---

### Exercise 3.1: Unicode Code Points

The slides explain that Unicode assigns code points to characters.
Write code to:
1. Get the code point of 'a' (should be U+0061)
2. Get the character for code point U+00F1 (Ã±)
3. Show the UTF-8 byte encoding of "hello"

In [None]:
# YOUR CODE HERE

### Exercise 3.2: Multi-byte UTF-8 Characters

The slides show that different scripts need different byte lengths.
Examine the byte lengths for:
1. ASCII character: 'A'
2. Spanish: 'Ã±' (U+00F1)
3. Chinese: 'å§š' (a character from the slides)
4. Emoji: 'ðŸ˜€'

In [None]:
# YOUR CODE HERE

### Exercise 3.3: String Length vs Byte Length

The slides note that len() returns code points, not bytes.
Compare string length vs byte length for mixed text.

In [None]:
# YOUR CODE HERE
mixed_text = "SeÃ±or ä½ å¥½ ðŸ˜€"

---
## Section 4: Shell Commands for Text Processing
---

### Exercise 4.1: Basic tr Command

The slides show Unix tools for tokenization.
Using shell commands, tokenize text by:
1. Converting all characters to lowercase
2. Replacing non-alphabetic chars with newlines

In [None]:
# Create a sample file first
sample_text = """THE SONNETS
by William Shakespeare
From fairest creatures
We desire increase"""

with open('sample.txt', 'w') as f:
    f.write(sample_text)

# YOUR SHELL COMMANDS HERE (use ! prefix in Jupyter)
# !cat sample.txt | tr ...

### Exercise 4.2: Word Frequency Count

Implement the complete pipeline from the slides:
```
tr -sc 'A-Za-z' '\n' < file | sort | uniq -c | sort -n -r
```
This should output word frequencies.

In [None]:
# YOUR CODE HERE - implement in shell or Python

### Exercise 4.3: The Mystery 'd'

The slides show that in Shakespeare's text, 'd' appears 8954 times.
The slide asks "What happened here?"

Explain why and write code to investigate contractions like "'d".

In [None]:
# YOUR CODE HERE