<a href="https://colab.research.google.com/github/sahanyafernando/Axiora/blob/main/Subword_Tokenization_in_Real_World_Scenarios.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Subword Tokenization in Real-World Scenarios

***Objective***: Understand how different subword tokenizers work and why they matter in real-world NLP tasks

***Scenario***: Think that when we are building a multilingual chatbot. Some users type rare or newly coined words like "ecoengineer", "hyperautomation".

***Question***: How do we ensure our model can handdle such words it hasn't seen during training ?

### Problem 1: Rare words confuse the model

In [None]:
# Let's simulate a sentence from a user
sentence = "Our ecoengineer developed a hyperautomation pipeline"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenizing with wordPiece (used in BERT)
tokens = tokenizer.tokenize(sentence)
print("wordPiece Tokens:", tokens)

# This gives the output below:
# wordPiece Tokens: ['our', 'eco', '##eng', '##ine', '##er', 'developed', 'a', 'hyper', '##au', '##tom', '##ation', 'pipeline']

### Problem 2: Laguages Without Spaces (like Chinese, Japanese)

Imagine a user types in Japanese or a compound Hindi word with no clear word boundaries

In [2]:
import sentencepiece as spm

# Train model on toy data
with open("data.txt", "w") as f:
    f.write("Our ecoengineer hyperautomation AI automation intelligence\n")

spm.SentencePieceTrainer.Train(input='data.txt', model_prefix='bpe_demo', vocab_size=40, model_type='bpe')

# Load and tokenize
sp = spm.SentencePieceProcessor(model_file='bpe_demo.model')
print(sp.encode('ecoengineer hyperautomation', out_type=str))

['▁', 'e', 'co', 'en', 'g', 'in', 'e', 'er', '▁', 'hy', 'p', 'er', 'automation']


SentencePiece can handle raw text without whitespase splitting. Very usefull for multiingual models or when users enter creative spellings.

### Problem 3: Training a Tokenizer for Your Domain

Let's think you are working in the healthcare domain and users type complex medical terms. You wanna train a custom tokenizer that understands medical jargon.

Let's train a tiny BPE tokenizer on a small domain-specific corpus.

In [4]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Simulated medical terms
texts = [
    "neuroplasticity", "neurogenisis", "immunotherapy",
    "cardiomyopathy", "pharmacogenomics", "biocompatibility"
]

tokenizer_bpe = Tokenizer(models.BPE())
tokenizer_bpe.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=40, show_progress=False)

tokenizer_bpe.train_from_iterator(texts, trainer)

output = tokenizer_bpe.encode("biocompatibility and neuroplasticity")
print("BPE Tokens", output.tokens)


BPE Tokens ['bi', 'o', 'com', 'p', 'at', 'i', 'bi', 'l', 'ity', 'a', 'n', 'd', 'neur', 'op', 'l', 'as', 't', 'ic', 'ity']


BPE learns the most frequent subword units from your domain data, which is especially helpful when public tokenizers miss important terms.

### About N-Grams ?

In [6]:
def generate_ngrams(text, n=2):
  words = text.split()
  return [words[i:i+n] for i in range(len(words)-n+1)]

generate_ngrams("Language models learn patterns", n=3)

[['Language', 'models', 'learn'], ['models', 'learn', 'patterns']]

N_Grams are fixed-size word chunks often used in traditonal models but don't adapt to unkown word structure like subwords.

### Now, What We Have Covered ?

- Subword tokenization is essential for rare or new words
- WordPiece (BERT): breaks words into knows base + suffixes
- SentencePiece: great for multilingual/raw text handling
- BPE: builds vocab from most frequent character merges, great for domain adaption.
- N-Grams: Traditonal but less flexible for rare/new terms.
