# **Tokenization**
Tokenization is the process of replacing sensitive data (like credit card numbers) with unique, non-sensitive placeholders called tokens, or breaking down text into smaller units (words/subwords) for NLP. It enhances data security by removing PII from systems and enables efficient AI language processing. 

## Terminoligies
- **Corpus**: A whole Paragraph
- **Document**: A sentence
- **Vocabulary**: All the *Unique* words
- **Words**: All the words (total)

## Implementation

In [2]:
corpus = """Pizza originated in Naples in the 18th century — as a simple street food for the poor — topped with tomatoes, cheese, and herbs! 
In 1889, baker Raffaele Esposito created "Pizza Margherita," honoring Queen Margherita of Savoy... iconic, right?
Italian immigrants later popularized pizza in the United States; making it world's favourite dish.
"""
print(corpus)

Pizza originated in Naples in the 18th century — as a simple street food for the poor — topped with tomatoes, cheese, and herbs! 
In 1889, baker Raffaele Esposito created "Pizza Margherita," honoring Queen Margherita of Savoy... iconic, right?
Italian immigrants later popularized pizza in the United States; making it world's favourite dish.



In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize

# breaking into sentence
documents = sent_tokenize(corpus)
documents

['Pizza originated in Naples in the 18th century — as a simple street food for the poor — topped with tomatoes, cheese, and herbs!',
 'In 1889, baker Raffaele Esposito created "Pizza Margherita," honoring Queen Margherita of Savoy... iconic, right?',
 "Italian immigrants later popularized pizza in the United States; making it world's favourite dish."]

In [4]:
for doc in documents:
    print(word_tokenize(doc))

['Pizza', 'originated', 'in', 'Naples', 'in', 'the', '18th', 'century', '—', 'as', 'a', 'simple', 'street', 'food', 'for', 'the', 'poor', '—', 'topped', 'with', 'tomatoes', ',', 'cheese', ',', 'and', 'herbs', '!']
['In', '1889', ',', 'baker', 'Raffaele', 'Esposito', 'created', '``', 'Pizza', 'Margherita', ',', "''", 'honoring', 'Queen', 'Margherita', 'of', 'Savoy', '...', 'iconic', ',', 'right', '?']
['Italian', 'immigrants', 'later', 'popularized', 'pizza', 'in', 'the', 'United', 'States', ';', 'making', 'it', 'world', "'s", 'favourite', 'dish', '.']


In [5]:
# treats punctunation as separate

for doc in documents:
    print(wordpunct_tokenize(doc))

['Pizza', 'originated', 'in', 'Naples', 'in', 'the', '18th', 'century', '—', 'as', 'a', 'simple', 'street', 'food', 'for', 'the', 'poor', '—', 'topped', 'with', 'tomatoes', ',', 'cheese', ',', 'and', 'herbs', '!']
['In', '1889', ',', 'baker', 'Raffaele', 'Esposito', 'created', '"', 'Pizza', 'Margherita', ',"', 'honoring', 'Queen', 'Margherita', 'of', 'Savoy', '...', 'iconic', ',', 'right', '?']
['Italian', 'immigrants', 'later', 'popularized', 'pizza', 'in', 'the', 'United', 'States', ';', 'making', 'it', 'world', "'", 's', 'favourite', 'dish', '.']
