<a href="https://colab.research.google.com/github/tpadmapriyaGitHub/AgenticAI/blob/Training/Subword_Tokenization_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Subword Tokenizer:

This notebook provides an overview of subword tokenization techniques in natural language processing (NLP), focusing on different tokenization methods such as Byte Pair Encoding (BPE) and SentencePiece. Here's a breakdown of the key components:

1. **GPT-2 Tokenization**:
   - Uses the pre-trained GPT-2 tokenizer to demonstrate how a sentence is tokenized and converted into token IDs.
   - Example sentence: Subword tokenization helps with rare and unseen words.

2. **Byte Pair Encoding (BPE)**:
   - A BPE tokenizer is created using the "tokenizers" library.
   - The corpus for training is a sample text related to subword tokenization.
   - The tokenizer is trained and used to tokenize a new sentence, illustrating how BPE works.
   - The notebook includes post-processing steps to handle special tokens like [CLS] and [SEP] used in models like BERT.
   
3. **SentencePiece**:
   - The SentencePiece tokenizer is trained on the same corpus with a small vocabulary size (50).
   - The trained model is used to tokenize and decode a sentence, showcasing how subword tokenization works in SentencePiece.
   
The overall goal of the notebook is to show how subword tokenization helps break down rare or unseen words into smaller units, improving the handling of complex words in NLP tasks.

Tech Stack:
1. Python 3
2. Transformers Library
3. Tokenizers Library
4. SentencePiece

In [1]:
from transformers import GPT2Tokenizer

# Load the pre-trained GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Example sentence
sentence = "Subword tokenization helps with rare and unseen words."

# Tokenize the sentence
tokenized_sentence = tokenizer.tokenize(sentence)

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokenized_sentence)

# Print the results
print("Original Sentence:", sentence)
print("Tokenized Sentence:", tokenized_sentence)
print("Token IDs:", token_ids)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Original Sentence: Subword tokenization helps with rare and unseen words.
Tokenized Sentence: ['Sub', 'word', 'Ġtoken', 'ization', 'Ġhelps', 'Ġwith', 'Ġrare', 'Ġand', 'Ġunseen', 'Ġwords', '.']
Token IDs: [7004, 4775, 11241, 1634, 5419, 351, 4071, 290, 29587, 2456, 13]


In [2]:
!pip install torchtext sentencepiece

Collecting torchtext
  Downloading torchtext-0.18.0-cp312-cp312-manylinux1_x86_64.whl.metadata (7.9 kB)
Downloading torchtext-0.18.0-cp312-cp312-manylinux1_x86_64.whl (2.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torchtext
Successfully installed torchtext-0.18.0


In [3]:
# Sample text for the corpus
sample_text = """
Subword tokenization is a powerful technique in natural language processing.
It helps break down rare and unseen words into smaller units.
This technique is used in many state-of-the-art models, including BERT and GPT.
Subword tokenization is particularly effective for languages with complex morphology.
"""

# Write the sample text to a file
with open('corpus.txt', 'w') as f:
    f.write(sample_text)

print("Corpus file 'corpus.txt' created successfully.")

Corpus file 'corpus.txt' created successfully.


In [4]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import BertProcessing
from tokenizers.decoders import ByteLevel as ByteLevelDecoder # Make sure to import ByteLevelDecoder
from tokenizers.normalizers import NFKC

# Initialize a BPE Tokenizer
tokenizer = Tokenizer(BPE())

# Use Whitespace pre-tokenizer and NFKC normalizer
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = NFKC()

# Train the tokenizer on a corpus
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
corpus = ["/content/corpus.txt"]  # Provide the path to your training corpus
tokenizer.train(corpus, trainer)

# Set post-processing to handle special tokens
tokenizer.post_processor = BertProcessing(
    ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ("[CLS]", tokenizer.token_to_id("[CLS]")),
)

# Set the decoder to reverse the byte-level encoding
tokenizer.decoder = ByteLevelDecoder() # You probably meant to use ByteLevelDecoder()

In [5]:
# Example text
text = "Subword tokenization is a powerful technique in natural language processing."

# Tokenize the text
encoded = tokenizer.encode(text)

# Get the tokenized text
tokens = encoded.tokens

# Print the tokenized text
print(tokens)

['[CLS]', 'Subword', 'tokenization', 'is', 'a', 'powerful', 'technique', 'in', 'natural', 'language', 'processing', '.', '[SEP]']


In [6]:
import sentencepiece as spm

# Train a SentencePiece model with a smaller vocabulary size
spm.SentencePieceTrainer.Train('--input=corpus.txt --model_prefix=m --vocab_size=50') # Changed vocab_size to 50

# Load the trained SentencePiece model
sp = spm.SentencePieceProcessor(model_file='m.model')

# Example sentence to tokenize
sentence = "Subword tokenization is very effective and efficient."

# Tokenize the sentence into subwords
tokenized_sentence = sp.encode(sentence, out_type=str)

# Convert subword tokens into their corresponding IDs
token_ids = sp.encode(sentence, out_type=int)

# Decode the tokenized sentence back to a string
decoded_sentence = sp.decode(token_ids)

# Print results
print("Original Sentence:", sentence)
print("Tokenized Sentence (Subwords):", tokenized_sentence)
print("Token IDs:", token_ids)
print("Decoded Sentence:", decoded_sentence)

Original Sentence: Subword tokenization is very effective and efficient.
Tokenized Sentence (Subwords): ['▁', 'S', 'u', 'b', 'word', '▁', 'to', 'k', 'e', 'ni', 'z', 'at', 'i', 'o', 'n', '▁i', 's', '▁', 'v', 'e', 'r', 'y', '▁', 'e', 'f', 'f', 'ec', 't', 'i', 'v', 'e', '▁', 'an', 'd', '▁', 'e', 'f', 'f', 'i', 'c', 'i', 'e', 'n', 't', '.']
Token IDs: [3, 34, 8, 25, 26, 3, 29, 28, 4, 20, 37, 19, 11, 7, 12, 32, 5, 3, 45, 4, 15, 27, 3, 4, 14, 14, 30, 10, 11, 45, 4, 3, 33, 13, 3, 4, 14, 14, 11, 16, 11, 4, 12, 10, 17]
Decoded Sentence: Subword tokenization is very effective and efficient.
