# Tokenization Warmup


In [5]:
# Install required libraries
!pip install nltk scikit-learn transformers torch torchtext


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.8.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [1]:


# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import torch
from transformers import BertTokenizer, BertModel, AutoTokenizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/raphael/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/raphael/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

We'll use the following example texts (corpus) throughout this notebook to showcase different text representation methods:

In [2]:
corpus = [
    # Technical/Academic
    "Natural language processing transforms text into numbers.",
    "Deep learning models understand context in language.",
    "Word embeddings capture semantic relationships between words.",
    "The company made $5.3 million in 2023 (a 10% increase).",
    "Transformer architectures revolutionized machine translation tasks.",
    "BERT models can be fine-tuned for specific downstream tasks.",
    "Attention mechanisms help models focus on relevant parts of input sequences.",
    "Token classification involves labeling individual words in a sentence.",
    "Semantic similarity measures how close two texts are in meaning.",
    "Language models predict the probability of word sequences.",
    
    # Business/Financial
    "The quarterly report indicated a 12.7% growth in emerging markets.",
    "Investors remained cautious despite promising economic indicators.",
    "The startup secured $8.5 million in Series A funding last month.",
    "Market volatility increased following the central bank's announcement.",
    "The merger is expected to be finalized by Q3, pending regulatory approval.",
    "Consumer confidence indices fell by 3.2 points in December.",
    "The company's stock price dropped 15% after the earnings call.",
    "Annual revenue exceeded projections by approximately $2.3 million.",
    "Operational costs were reduced by implementing new software solutions.",
    "The board unanimously approved the five-year strategic plan.",
    
    # Conversational/Informal
    "I can't believe she said that to her boss yesterday!",
    "Have you tried that new restaurant on Main Street yet?",
    "The movie was okay, but the book was much better.",
    "Could you pick up some groceries on your way home?",
    "I'm thinking about getting a dog, what do you think?",
    "That concert was absolutely amazing, best one this year!",
    "My flight got delayed, so I'll be arriving three hours late.",
    "We should definitely hang out this weekend if you're free.",
    "The weather has been unusually warm for February.",
    "Did you see the game last night? What an incredible comeback!",
    
    # Health/Medical
    "Patients showed a 40% reduction in symptoms after the treatment.",
    "Regular exercise may decrease the risk of cardiovascular disease.",
    "The clinical trial included 2,500 participants across 12 countries.",
    "The drug was approved for use after demonstrating efficacy in Phase III trials.",
    "Researchers identified a novel biomarker associated with early-stage cancer.",
    "Telemedicine appointments increased by 287% during the pandemic.",
    "The study found a statistically significant correlation (p<0.01) between diet and inflammatory markers.",
    "Vaccination rates vary considerably between urban and rural communities.",
    "Patient-reported outcomes were measured using standardized questionnaires.",
    "The surgical procedure has a recovery period of approximately 4-6 weeks.",
    
    # News Headlines/Sentences
    "Global leaders gather for climate summit amid rising tensions.",
    "Tech giant unveils revolutionary AI system at annual conference.",
    "Historic peace agreement signed after decades of conflict.",
    "Scientists discover potential breakthrough in renewable energy storage.",
    "Major transportation strike disrupts commuters for third consecutive day.",
    "Supreme Court issues landmark ruling on digital privacy rights.",
    "Tropical storm causes extensive damage along coastal regions.",
    "Olympic athlete breaks world record in spectacular final performance.",
    "New legislation aims to address growing housing affordability crisis.",
    "International space mission successfully completes first phase of exploration.",
    
    # Questions/Queries
    "What are the main differences between supervised and unsupervised learning?",
    "How does cloud computing impact business scalability?",
    "When was the Declaration of Independence signed?",
    "Why do leaves change color in autumn?",
    "Where can I find reliable information about renewable energy technologies?",
    "Who is considered the founder of modern computer science?",
    "What causes earthquakes and how are they measured?",
    "How many calories are in an average apple?",
    "What's the fastest way to get from New York to Boston?",
    "What should I consider when buying my first home?",
    
    # Complex Sentences
    "Despite initial skepticism from industry experts, the revolutionary approach proved successful in addressing long-standing challenges.",
    "The committee, having reviewed all submitted proposals, recommended proceeding with the most cost-effective option while maintaining quality standards.",
    "Although the experiment yielded unexpected results, researchers identified several promising avenues for future investigation that could potentially transform the field.",
    "The novel, which interweaves historical events with fictional narratives, offers a nuanced perspective on the socio-political landscape of the era.",
    "While acknowledging the limitations of current technology, engineers remain optimistic about overcoming these obstacles through collaborative innovation.",
    "The report highlights that, contrary to popular belief, implementing sustainable practices often leads to improved long-term profitability.",
    "When confronted with conflicting evidence, the team opted to conduct additional tests before drawing any definitive conclusions.",
    "As urbanization continues to accelerate globally, cities face unprecedented challenges in managing resources, infrastructure, and social equity.",
    "The documentary explores how, throughout human history, technological advancements have simultaneously solved existing problems and created new ones.",
    "Considering the multifaceted nature of the issue, policymakers advocate for an integrated approach that addresses both immediate concerns and underlying systemic factors."
]

## 1. Tokenization

Tokenization is the process of breaking text into smaller units (tokens) that serve as the basic elements for numerical representation.

### 1.1 Word-Level Tokenization

Word tokenization splits text at word boundaries, typically using spaces and punctuation as delimiters.

In [3]:
for i, focus_text in enumerate(corpus[:5]):
    # We'll focus our word-level examples on this sentence
    print(focus_text)
    # Simple space-based tokenization
    basic_tokens = focus_text.split()
    print("Basic tokens:", basic_tokens)
    
    # NLTK word tokenization (handles punctuation better)
    nltk_tokens = word_tokenize(focus_text)
    print("NLTK tokens:", nltk_tokens)

Natural language processing transforms text into numbers.
Basic tokens: ['Natural', 'language', 'processing', 'transforms', 'text', 'into', 'numbers.']
NLTK tokens: ['Natural', 'language', 'processing', 'transforms', 'text', 'into', 'numbers', '.']
Deep learning models understand context in language.
Basic tokens: ['Deep', 'learning', 'models', 'understand', 'context', 'in', 'language.']
NLTK tokens: ['Deep', 'learning', 'models', 'understand', 'context', 'in', 'language', '.']
Word embeddings capture semantic relationships between words.
Basic tokens: ['Word', 'embeddings', 'capture', 'semantic', 'relationships', 'between', 'words.']
NLTK tokens: ['Word', 'embeddings', 'capture', 'semantic', 'relationships', 'between', 'words', '.']
The company made $5.3 million in 2023 (a 10% increase).
Basic tokens: ['The', 'company', 'made', '$5.3', 'million', 'in', '2023', '(a', '10%', 'increase).']
NLTK tokens: ['The', 'company', 'made', '$', '5.3', 'million', 'in', '2023', '(', 'a', '10', '%', '

In [4]:
# Basic space-based tokenization methods
def basic_tokenize_to_ids(corpus):
    # Create vocabulary
    vocab = set()
    for text in corpus:
        tokens = text.split()
        vocab.update(tokens)
    
    # Create word-to-id and id-to-word mappings
    word_to_id = {word: idx for idx, word in enumerate(sorted(vocab))}
    id_to_word = {idx: word for word, idx in word_to_id.items()}
    
    # Convert texts to token ids
    tokenized_corpus = []
    for text in corpus:
        tokens = text.split()
        token_ids = [word_to_id[token] for token in tokens]
        tokenized_corpus.append(token_ids)
    
    return tokenized_corpus, word_to_id, id_to_word

def basic_ids_to_text(token_ids, id_to_word):
    tokens = [id_to_word[idx] for idx in token_ids]
    return ' '.join(tokens)


In [5]:
print(f"Basic Token vocabulary size: {len(basic_tokenize_to_ids(corpus)[1])}")

Basic Token vocabulary size: 562


In [7]:
for i, focus_text in enumerate(corpus[:5]):
    print(focus_text)
    basic_tokenized, basic_word_to_id, basic_id_to_word = basic_tokenize_to_ids([focus_text])
    print("Token IDs:", basic_tokenized[0])
    reconstructed_basic = basic_ids_to_text(basic_tokenized[0], basic_id_to_word)
    print("Reconstructed:", reconstructed_basic)
    print()

Natural language processing transforms text into numbers.
Token IDs: [0, 2, 4, 6, 5, 1, 3]
Reconstructed: Natural language processing transforms text into numbers.

Deep learning models understand context in language.
Token IDs: [0, 4, 5, 6, 1, 2, 3]
Reconstructed: Deep learning models understand context in language.

Word embeddings capture semantic relationships between words.
Token IDs: [0, 3, 2, 5, 4, 1, 6]
Reconstructed: Word embeddings capture semantic relationships between words.

The company made $5.3 million in 2023 (a 10% increase).
Token IDs: [4, 5, 8, 0, 9, 6, 3, 1, 2, 7]
Reconstructed: The company made $5.3 million in 2023 (a 10% increase).

Transformer architectures revolutionized machine translation tasks.
Token IDs: [0, 1, 3, 2, 5, 4]
Reconstructed: Transformer architectures revolutionized machine translation tasks.



### Questions:
Write the `nltk_word_tokenize_to_ids` fonction and the `nltk_word_ids_to_text` functions. What is the size of the vocabulary for this corpus using nltk word_tokenize? How do you explain it?

### 1.2 Character-Level Tokenization

Character tokenization breaks text into individual characters, offering a very small vocabulary but requiring longer sequences.

In [8]:
for i, focus_text in enumerate(corpus[:5]):
    # We'll focus our word-level examples on this sentence
    print(focus_text)
    
    # Character tokenization
    char_tokens = list(focus_text)
    print("Character tokens (first 20):", char_tokens[:20])
    print(f"Total tokens: {len(char_tokens)}")
    
    # Create character vocabulary
    char_vocab = sorted(set(char_tokens))
    print(f"Character vocabulary (size: {len(char_vocab)}):", char_vocab)

Natural language processing transforms text into numbers.
Character tokens (first 20): ['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o']
Total tokens: 57
Character vocabulary (size: 20): [' ', '.', 'N', 'a', 'b', 'c', 'e', 'f', 'g', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'x']
Deep learning models understand context in language.
Character tokens (first 20): ['D', 'e', 'e', 'p', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'm', 'o', 'd', 'e', 'l', 's']
Total tokens: 52
Character vocabulary (size: 19): [' ', '.', 'D', 'a', 'c', 'd', 'e', 'g', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'x']
Word embeddings capture semantic relationships between words.
Character tokens (first 20): ['W', 'o', 'r', 'd', ' ', 'e', 'm', 'b', 'e', 'd', 'd', 'i', 'n', 'g', 's', ' ', 'c', 'a', 'p', 't']
Total tokens: 61
Character vocabulary (size: 21): [' ', '.', 'W', 'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's',

In [9]:
# Character-Level Tokenization methods
def char_tokenize_to_ids(corpus):
    # Create vocabulary from all characters in the corpus
    vocab = set()
    for text in corpus:
        chars = list(text)
        vocab.update(chars)
    
    # Create char-to-id and id-to-char mappings
    char_to_id = {char: idx for idx, char in enumerate(sorted(vocab))}
    id_to_char = {idx: char for char, idx in char_to_id.items()}
    
    # Convert texts to character token ids
    tokenized_corpus = []
    for text in corpus:
        chars = list(text)
        char_ids = [char_to_id[char] for char in chars]
        tokenized_corpus.append(char_ids)
    
    return tokenized_corpus, char_to_id, id_to_char

def char_ids_to_text(token_ids, id_to_char):
    chars = [id_to_char[idx] for idx in token_ids]
    return ''.join(chars)


In [10]:
print(f"Character Token vocabulary size: {len(char_tokenize_to_ids(corpus)[1])}")

Character Token vocabulary size: 68


In [11]:

# Example usage for character-level tokenization
for i, focus_text in enumerate(corpus[:5]):
    # We'll focus our character-level examples on this sentence
    print(f"\nExample {i+1}: {focus_text}")
    
    # Character tokenization
    char_tokens = list(focus_text)
    print("Character tokens (first 20):", char_tokens[:20])
    print(f"Total tokens: {len(char_tokens)}")
    
    # Create character vocabulary
    char_vocab = sorted(set(char_tokens))
    print(f"Character vocabulary (size: {len(char_vocab)}):", char_vocab)
    
    # Convert to numerical tokens and back
    print("\nCharacter tokenization method:")
    char_tokenized, char_to_id, id_to_char = char_tokenize_to_ids([focus_text])
    print("Token IDs (first 20):", char_tokenized[0][:20])
    reconstructed_char = char_ids_to_text(char_tokenized[0], id_to_char)
    print("Reconstructed:", reconstructed_char)
    print("Original matches reconstructed:", focus_text == reconstructed_char)


Example 1: Natural language processing transforms text into numbers.
Character tokens (first 20): ['N', 'a', 't', 'u', 'r', 'a', 'l', ' ', 'l', 'a', 'n', 'g', 'u', 'a', 'g', 'e', ' ', 'p', 'r', 'o']
Total tokens: 57
Character vocabulary (size: 20): [' ', '.', 'N', 'a', 'b', 'c', 'e', 'f', 'g', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'x']

Character tokenization method:
Token IDs (first 20): [2, 3, 17, 18, 15, 3, 10, 0, 10, 3, 12, 8, 18, 3, 8, 6, 0, 14, 15, 13]
Reconstructed: Natural language processing transforms text into numbers.
Original matches reconstructed: True

Example 2: Deep learning models understand context in language.
Character tokens (first 20): ['D', 'e', 'e', 'p', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'm', 'o', 'd', 'e', 'l', 's']
Total tokens: 52
Character vocabulary (size: 19): [' ', '.', 'D', 'a', 'c', 'd', 'e', 'g', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'x']

Character tokenization method:
Token IDs (first 20): [2, 6, 6, 13, 0, 9,

### 1.3 Subword Tokenization

Subword tokenization methods break words into meaningful subword units, balancing vocabulary size against semantic granularity.

In [18]:
from transformers import AutoTokenizer, BertTokenizer

# Vocabulary size information
BERT_VOCAB_SIZE = 30522  # bert-base-uncased vocabulary size
GPT2_VOCAB_SIZE = 50257  # gpt2 vocabulary size

for i, focus_text in enumerate(corpus[:5]):
    # We'll focus our word-level examples on this sentence
    print(f"\nExample {i+1}: {focus_text}")
    
    # BPE tokenization using GPT-2's tokenizer
    gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
    gpt2_tokens = gpt2_tokenizer.tokenize(focus_text)
    gpt2_token_ids = gpt2_tokenizer.encode(focus_text)
    print(f"GPT-2 (BPE) tokens ({len(gpt2_tokens)} tokens from a vocabulary of {GPT2_VOCAB_SIZE}):")
    print(gpt2_tokens)
    print(f"GPT-2 token IDs: {gpt2_token_ids}")
    
    # Convert back to text
    gpt2_decoded = gpt2_tokenizer.decode(gpt2_token_ids)
    print(f"GPT-2 decoded: {gpt2_decoded}")
    
    # WordPiece tokenization using BERT's tokenizer
    bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    bert_tokens = bert_tokenizer.tokenize(focus_text.lower())  # BERT uncased requires lowercase
    bert_token_ids = bert_tokenizer.encode(focus_text.lower())
    print(f"\nBERT (WordPiece) tokens ({len(bert_tokens)} tokens from a vocabulary of {BERT_VOCAB_SIZE}):")
    print(bert_tokens)
    print(f"BERT token IDs: {bert_token_ids}")
    
    # Convert back to text
    bert_decoded = bert_tokenizer.decode(bert_token_ids)
    print(f"BERT decoded: {bert_decoded}")
    
    # Note on CLS/SEP tokens
    print("\nNote: BERT token IDs include special [CLS] and [SEP] tokens that surround the input text.")


Example 1: Natural language processing transforms text into numbers.
GPT-2 (BPE) tokens (8 tokens from a vocabulary of 50257):
['Natural', 'Ġlanguage', 'Ġprocessing', 'Ġtransforms', 'Ġtext', 'Ġinto', 'Ġnumbers', '.']
GPT-2 token IDs: [35364, 3303, 7587, 31408, 2420, 656, 3146, 13]
GPT-2 decoded: Natural language processing transforms text into numbers.

BERT (WordPiece) tokens (8 tokens from a vocabulary of 30522):
['natural', 'language', 'processing', 'transforms', 'text', 'into', 'numbers', '.']
BERT token IDs: [101, 3019, 2653, 6364, 21743, 3793, 2046, 3616, 1012, 102]
BERT decoded: [CLS] natural language processing transforms text into numbers. [SEP]

Note: BERT token IDs include special [CLS] and [SEP] tokens that surround the input text.

Example 2: Deep learning models understand context in language.
GPT-2 (BPE) tokens (8 tokens from a vocabulary of 50257):
['Deep', 'Ġlearning', 'Ġmodels', 'Ġunderstand', 'Ġcontext', 'Ġin', 'Ġlanguage', '.']
GPT-2 token IDs: [29744, 4673, 4981, 

### Questions:
Using the `BertTokenizer` and `gpt2_tokenizer` to transform the sentence bellow to tokens and token ids

In [16]:
sentence = "we want the numerical representation of this sentence"