# Tokenization in Natural Language Processing

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/vuhung16au/nlp-learning-journey/blob/main/examples/tokenization.ipynb)

## Overview

Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be words, subwords, characters, or other meaningful elements. It's one of the fundamental preprocessing steps in NLP.

## What You'll Learn

- Different types of tokenization
- Word-level tokenization
- Subword tokenization
- Sentence tokenization
- Using popular NLP libraries (NLTK, spaCy, Transformers)

## Prerequisites

Basic understanding of Python and text processing concepts.

## Setup and Installation

Let's install the required libraries for this notebook.

In [None]:
# Install required libraries
!pip install nltk spacy transformers tokenizers
!python -m spacy download en_core_web_sm

In [None]:
# Import required libraries
import nltk
import spacy
from transformers import AutoTokenizer
from tokenizers import Tokenizer
import re

# Download NLTK data
nltk.download('punkt')
nltk.download('wordnet')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

## Sample Text

Let's use a sample text to demonstrate different tokenization techniques.

In [None]:
sample_text = """
Natural Language Processing (NLP) is a fascinating field! It combines linguistics, 
computer science, and machine learning. Today's NLP models can understand context, 
generate text, and even translate languages. Isn't that amazing? Let's explore 
tokenization techniques. Visit www.example.com for more info!
"""

print("Sample Text:")
print(sample_text)

## 1. Basic Tokenization

### Simple Split-based Tokenization

The simplest form of tokenization is splitting text by whitespace.

In [None]:
# Basic whitespace tokenization
basic_tokens = sample_text.split()

print("Basic Split Tokenization:")
print(f"Number of tokens: {len(basic_tokens)}")
print("First 10 tokens:", basic_tokens[:10])
print("\nLimitations: Notice punctuation is attached to words")

### Regular Expression Tokenization

We can use regular expressions for more sophisticated tokenization.

In [None]:
# Regular expression tokenization
import re

# Pattern to match words (alphanumeric characters)
pattern = r'\b\w+\b'
regex_tokens = re.findall(pattern, sample_text)

print("Regular Expression Tokenization:")
print(f"Number of tokens: {len(regex_tokens)}")
print("First 10 tokens:", regex_tokens[:10])
print("\nNote: Punctuation is now separated from words")

## 2. NLTK Tokenization

NLTK provides robust tokenization methods that handle various edge cases.

In [None]:
# NLTK Word Tokenization
from nltk.tokenize import word_tokenize, sent_tokenize

nltk_word_tokens = word_tokenize(sample_text)

print("NLTK Word Tokenization:")
print(f"Number of tokens: {len(nltk_word_tokens)}")
print("All tokens:", nltk_word_tokens)
print("\nAdvantages: Handles punctuation, contractions, and special cases better")

In [None]:
# NLTK Sentence Tokenization
sentences = sent_tokenize(sample_text)

print("NLTK Sentence Tokenization:")
print(f"Number of sentences: {len(sentences)}")
for i, sent in enumerate(sentences, 1):
    print(f"Sentence {i}: {sent.strip()}")

## 3. spaCy Tokenization

spaCy provides industrial-strength tokenization with linguistic awareness.

In [None]:
# spaCy tokenization
doc = nlp(sample_text)

print("spaCy Tokenization:")
print(f"Number of tokens: {len(doc)}")

# Display tokens with additional information
print("\nTokens with linguistic information:")
for token in doc[:15]:  # First 15 tokens
    print(f"Token: '{token.text}' | Lemma: '{token.lemma_}' | POS: {token.pos_} | Is_alpha: {token.is_alpha}")

In [None]:
# spaCy sentence segmentation
print("spaCy Sentence Segmentation:")
for i, sent in enumerate(doc.sents, 1):
    print(f"Sentence {i}: {sent.text.strip()}")

## 4. Subword Tokenization

Subword tokenization is crucial for handling out-of-vocabulary words and is widely used in modern NLP models.

### BERT Tokenization (WordPiece)

BERT uses WordPiece tokenization, which breaks words into subword units.

In [None]:
# BERT tokenizer (WordPiece)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the text
bert_tokens = bert_tokenizer.tokenize(sample_text)
bert_token_ids = bert_tokenizer.encode(sample_text)

print("BERT WordPiece Tokenization:")
print(f"Number of tokens: {len(bert_tokens)}")
print("\nFirst 20 tokens:")
for token in bert_tokens[:20]:
    print(f"'{token}'")

print("\nSpecial tokens:")
print(f"[CLS] token ID: {bert_tokenizer.cls_token_id}")
print(f"[SEP] token ID: {bert_tokenizer.sep_token_id}")
print(f"[PAD] token ID: {bert_tokenizer.pad_token_id}")

### GPT Tokenization (Byte Pair Encoding)

GPT models use Byte Pair Encoding (BPE) for tokenization.

In [None]:
# GPT tokenizer (BPE)
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Set pad token (GPT-2 doesn't have one by default)
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token

# Tokenize the text
gpt_tokens = gpt_tokenizer.tokenize(sample_text)
gpt_token_ids = gpt_tokenizer.encode(sample_text)

print("GPT BPE Tokenization:")
print(f"Number of tokens: {len(gpt_tokens)}")
print("\nFirst 20 tokens:")
for token in gpt_tokens[:20]:
    print(f"'{token}'")

print("\nNote: Ġ symbol represents beginning of a word in GPT tokenization")

## 5. Tokenization Comparison

Let's compare the different tokenization methods on a complex example.

In [None]:
complex_text = "The pre-trained transformer-based models can handle out-of-vocabulary words like 'supercalifragilisticexpialidocious' effectively!"

print("Complex Text:", complex_text)
print("\n" + "="*50)

# Basic split
basic = complex_text.split()
print(f"\nBasic Split ({len(basic)} tokens):")
print(basic)

# NLTK
nltk_tokens = word_tokenize(complex_text)
print(f"\nNLTK ({len(nltk_tokens)} tokens):")
print(nltk_tokens)

# spaCy
spacy_doc = nlp(complex_text)
spacy_tokens = [token.text for token in spacy_doc]
print(f"\nspaCy ({len(spacy_tokens)} tokens):")
print(spacy_tokens)

# BERT
bert_tokens_complex = bert_tokenizer.tokenize(complex_text)
print(f"\nBERT WordPiece ({len(bert_tokens_complex)} tokens):")
print(bert_tokens_complex)

# GPT
gpt_tokens_complex = gpt_tokenizer.tokenize(complex_text)
print(f"\nGPT BPE ({len(gpt_tokens_complex)} tokens):")
print(gpt_tokens_complex)

## 6. Practical Considerations

### Handling Special Cases

In [None]:
special_text = """
Email: user@example.com
URL: https://www.example.com/path?param=value
Phone: +1-555-123-4567
Hashtag: #NLP #MachineLearning
Mention: @username
Contraction: don't, won't, can't
Numbers: 3.14, $100.50, 1,000,000
Unicode: 😀 🌟 ñáéíóú
"""

print("Special Cases Text:")
print(special_text)

print("\n" + "="*50)
print("\nNLTK Tokenization:")
nltk_special = word_tokenize(special_text)
print(nltk_special)

print("\nspaCy Tokenization:")
spacy_special = [token.text for token in nlp(special_text)]
print(spacy_special)

### Custom Tokenization Rules

In [None]:
def custom_tokenizer(text):
    """
    Custom tokenizer that preserves certain patterns
    """
    # Preserve emails, URLs, hashtags, mentions
    import re
    
    # Define patterns
    patterns = {
        'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
        'url': r'https?://[^\s]+',
        'hashtag': r'#\w+',
        'mention': r'@\w+',
        'phone': r'\+?1?-?\(?\d{3}\)?-?\d{3}-?\d{4}',
    }
    
    tokens = []
    remaining_text = text
    
    # Find and extract special patterns
    for pattern_name, pattern in patterns.items():
        matches = re.finditer(pattern, remaining_text)
        for match in matches:
            tokens.append((match.group(), pattern_name))
    
    # Use NLTK for the rest
    for pattern in patterns.values():
        remaining_text = re.sub(pattern, ' ', remaining_text)
    
    regular_tokens = [(token, 'word') for token in word_tokenize(remaining_text) if token.strip()]
    
    return tokens + regular_tokens

# Test custom tokenizer
custom_tokens = custom_tokenizer(special_text)
print("Custom Tokenization with Type Labels:")
for token, token_type in custom_tokens[:20]:  # First 20 tokens
    print(f"'{token}' ({token_type})")

## 7. Exercises

Try these exercises to practice tokenization:

### Exercise 1: Multilingual Tokenization

Test different tokenizers on multilingual text.

In [None]:
multilingual_text = """
English: Hello, how are you?
Spanish: Hola, ¿cómo estás?
French: Bonjour, comment allez-vous?
German: Hallo, wie geht es dir?
Chinese: 你好，你好吗？
Japanese: こんにちは、元気ですか？
"""

print("Multilingual Text Tokenization:")
print("Original text:")
print(multilingual_text)

# TODO: Try different tokenizers on this multilingual text
# Compare NLTK, spaCy, and transformer tokenizers

# Your code here:


### Exercise 2: Token Statistics

Analyze tokenization statistics for a longer text.

In [None]:
# TODO: Implement a function that takes a text and returns:
# - Total number of tokens
# - Average token length
# - Most common tokens
# - Vocabulary size (unique tokens)

def analyze_tokenization(text, tokenizer_func):
    """
    Analyze tokenization statistics
    """
    # Your implementation here
    pass

# Test with a longer text (you can use any text you like)
long_text = """
Natural language processing (NLP) is a subfield of linguistics, computer science, 
and artificial intelligence concerned with the interactions between computers and 
human language, in particular how to program computers to process and analyze 
large amounts of natural language data. The goal is a computer capable of 
understanding the contents of documents, including the contextual nuances of 
the language within them.
"""

# Your analysis code here:


## Key Takeaways

1. **Different tokenizers for different purposes**:
   - Basic split: Fast but limited
   - NLTK: Good for traditional NLP tasks
   - spaCy: Industrial-strength with linguistic features
   - Transformer tokenizers: Essential for modern NLP models

2. **Subword tokenization advantages**:
   - Handles out-of-vocabulary words
   - More efficient vocabulary usage
   - Better for morphologically rich languages

3. **Consider your use case**:
   - Text preprocessing: NLTK or spaCy
   - Modern NLP models: Use matching tokenizer
   - Custom applications: May need custom rules

4. **Important considerations**:
   - Language support
   - Special character handling
   - Performance requirements
   - Downstream task compatibility

## Next Steps

- Explore text normalization techniques
- Learn about stemming and lemmatization
- Study different subword algorithms (BPE, WordPiece, SentencePiece)
- Practice with domain-specific texts (social media, biomedical, legal)

## Resources

- [NLTK Documentation](https://www.nltk.org/)
- [spaCy Documentation](https://spacy.io/)
- [Hugging Face Tokenizers](https://huggingface.co/docs/tokenizers/)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/abs/1508.07909)
