## **Tokenization**
Tokenization is the process of replacing sensitive data (like credit card numbers) with unique, non-sensitive placeholders called tokens, or breaking down text into smaller units (words/subwords) for NLP. It enhances data security by removing PII from systems and enables efficient AI language processing. 

### Terminoligies
- **Corpus**: A whole Paragraph
- **Document**: A sentence
- **Vocabulary**: All the *Unique* words
- **Words**: All the words (total)

### Implementation

In [2]:
corpus = """Pizza originated in Naples in the 18th century — as a simple street food for the poor — topped with tomatoes, cheese, and herbs! 
In 1889, baker Raffaele Esposito created "Pizza Margherita," honoring Queen Margherita of Savoy... iconic, right?
Italian immigrants later popularized pizza in the United States; making it world's favourite dish.
"""
print(corpus)

Pizza originated in Naples in the 18th century — as a simple street food for the poor — topped with tomatoes, cheese, and herbs! 
In 1889, baker Raffaele Esposito created "Pizza Margherita," honoring Queen Margherita of Savoy... iconic, right?
Italian immigrants later popularized pizza in the United States; making it world's favourite dish.



In [3]:
from nltk.tokenize import sent_tokenize, word_tokenize, wordpunct_tokenize

# breaking into sentence
documents = sent_tokenize(corpus)
documents

['Pizza originated in Naples in the 18th century — as a simple street food for the poor — topped with tomatoes, cheese, and herbs!',
 'In 1889, baker Raffaele Esposito created "Pizza Margherita," honoring Queen Margherita of Savoy... iconic, right?',
 "Italian immigrants later popularized pizza in the United States; making it world's favourite dish."]

In [4]:
for doc in documents:
    print(word_tokenize(doc))

['Pizza', 'originated', 'in', 'Naples', 'in', 'the', '18th', 'century', '—', 'as', 'a', 'simple', 'street', 'food', 'for', 'the', 'poor', '—', 'topped', 'with', 'tomatoes', ',', 'cheese', ',', 'and', 'herbs', '!']
['In', '1889', ',', 'baker', 'Raffaele', 'Esposito', 'created', '``', 'Pizza', 'Margherita', ',', "''", 'honoring', 'Queen', 'Margherita', 'of', 'Savoy', '...', 'iconic', ',', 'right', '?']
['Italian', 'immigrants', 'later', 'popularized', 'pizza', 'in', 'the', 'United', 'States', ';', 'making', 'it', 'world', "'s", 'favourite', 'dish', '.']


In [5]:
# treats punctunation as separate

for doc in documents:
    print(wordpunct_tokenize(doc))

['Pizza', 'originated', 'in', 'Naples', 'in', 'the', '18th', 'century', '—', 'as', 'a', 'simple', 'street', 'food', 'for', 'the', 'poor', '—', 'topped', 'with', 'tomatoes', ',', 'cheese', ',', 'and', 'herbs', '!']
['In', '1889', ',', 'baker', 'Raffaele', 'Esposito', 'created', '"', 'Pizza', 'Margherita', ',"', 'honoring', 'Queen', 'Margherita', 'of', 'Savoy', '...', 'iconic', ',', 'right', '?']
['Italian', 'immigrants', 'later', 'popularized', 'pizza', 'in', 'the', 'United', 'States', ';', 'making', 'it', 'world', "'", 's', 'favourite', 'dish', '.']


## **Stemming**
Stemming in NLP is a text preprocessing technique that reduces words to their base or root form (e.g., "running", "runs", "runner" become "run") by removing suffixes and prefixes. It is used for normalization in search engines, chatbots, and text mining to improve efficiency by reducing vocabulary size. 

### Implementation

In [11]:
words = ['running', 'runs', 'eating', 'walking', 'eaten', 'walks', 'goes', 'go', 'going', 'fairly', 'sportingly']

In [12]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

for w in words:
    print(f"{w}: {porter.stem(w)}")

running: run
runs: run
eating: eat
walking: walk
eaten: eaten
walks: walk
goes: goe
go: go
going: go
fairly: fairli
sportingly: sportingli


In [13]:
from nltk.stem import SnowballStemmer

snow = SnowballStemmer(language='english')

for w in words:
    print(f"{w}: {snow.stem(w)}")

running: run
runs: run
eating: eat
walking: walk
eaten: eaten
walks: walk
goes: goe
go: go
going: go
fairly: fair
sportingly: sport


In [15]:
from nltk.stem import RegexpStemmer

reg = RegexpStemmer(regexp="ing$|es$|ly$|$s|en$")

for w in words:
    print(f"{w}: {reg.stem(w)}")

running: runn
runs: runs
eating: eat
walking: walk
eaten: eat
walks: walks
goes: go
go: go
going: go
fairly: fair
sportingly: sporting
