# Text Tokenization Exercise

This exercise explores the challenges of splitting text into sentences and words when dealing with complex real-world text containing dates, amounts, URLs, emails, acronyms, and multi-word expressions.

## The Challenge

Given a text variable, split it into:
1. **Sentences** - logical units of meaning ending with terminal punctuation
2. **Words (tokens)** - individual meaningful units

In [2]:
# Sample text with challenging elements
import re
text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

print("Original text:")
print(text)
sentences = re.split(r'(?<=[.!?])\s+', text)

print("Sentences:")
for s in sentences:
    print("-", s)


Original text:
Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp. You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info. The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.
Sentences:
- Dr.
- John Smith, Ph.D., earned $1,250.50 on Jan.
- 15, 2024, for his work at A.I.
- Corp.
- You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.
- The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M.


In [3]:
tokens = re.findall(r'\b\w+\b', text)

print("\nTokens:")
print(tokens)


Tokens:
['Dr', 'John', 'Smith', 'Ph', 'D', 'earned', '1', '250', '50', 'on', 'Jan', '15', '2024', 'for', 'his', 'work', 'at', 'A', 'I', 'Corp', 'You', 'can', 'reach', 'him', 'at', 'j', 'smith', 'ai', 'corp', 'co', 'uk', 'or', 'visit', 'https', 'www', 'ai', 'corp', 'co', 'uk', 'team', 'dr', 'smith', 'for', 'more', 'info', 'The', 'U', 'S', 'A', 'based', 'company', 'reported', 'a', '23', '5', 'increase', 'in', 'Q3', 'revenue', 'totaling', '2', '5M']


# Corpus Tokenization Exercise

This exercise explores the challenges of splitting words in large corpuses and find the most common words. 

In [None]:
import re
from collections import Counter

text = """Dr. John Smith, Ph.D., earned $1,250.50 on Jan. 15, 2024, for his work at A.I. Corp.
You can reach him at j.smith@ai-corp.co.uk or visit https://www.ai-corp.co.uk/team/dr-smith for more info.
The U.S.A.-based company reported a 23.5% increase in Q3 revenue, totaling €2.5M."""

tokens = re.findall(r'\b\w+\b', text.lower())
word_counts = Counter(tokens)
print("Most common words:")
for word, count in word_counts.most_common(10):
    print(word, count)  

Most common words:
smith 3
a 3
corp 3
dr 2
for 2
at 2
ai 2
co 2
uk 2
john 1


## The Challenge

Given a file `shakes.txt` in the book folder. Find the words that are more common in Shakespeare's book. 

In [None]:
import re
from collections import Counter

# Leer archivo
with open("shakes.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Tokenizar (dividir en palabras)
tokens = re.findall(r'\b\w+\b', text.lower())

# Contar palabras
word_counts = Counter(tokens)

# Mostrar las más comunes
print("Most common words in Shakespeare:")
for word, count in word_counts.most_common(20):
    print(word, count)

Most common words in Shakespeare:
the 27843
and 26847
i 22538
to 19883
of 18307
a 14800
you 13928
my 12489
that 11563
in 11183
is 9808
d 8961
not 8760
for 8358
with 8066
me 7777
it 7750
s 7734
be 7147
this 6900
