# Odia BPE Tokenizer

This notebook demonstrates the basic usage of the OdiaBPETokenizer class.



**complete Unicode range for the Odia (‡¨ì‡¨°‡¨º‡¨ø‡¨Ü) script**, as defined by the Unicode Consortium.

---

### üìò **Odia Unicode Block**

| Property              | Value             |
| --------------------- | ----------------- |
| **Block name**        | Odia              |
| **Unicode range**     | `U+0B00 ‚Äì U+0B7F` |
| **Code points count** | 128 (0x80)        |
| **Script**            | Odia              |
| **Direction**         | Left-to-Right     |

---

### üî§ **Detailed Character Categories**

| Category                              | Unicode Range                               | Description                        | Example                     |
| ------------------------------------- | ------------------------------------------- | ---------------------------------- | --------------------------- |
| **Independent vowels**                | U+0B05‚ÄìU+0B14                               | ‡¨Ö, ‡¨Ü, ‡¨á, ‡¨à, ‡¨â, ‡¨ä, ‡¨ã, ‡≠†, ‡¨è, ‡¨ê, ‡¨ì, ‡¨î | ‡¨Ö, ‡¨Ü                        |
| **Consonants**                        | U+0B15‚ÄìU+0B39                               | ‡¨ï, ‡¨ñ, ‡¨ó, ‡¨ò ... ‡¨π                   | ‡¨ï, ‡¨ó                        |
| **Dependent vowel signs (matras)**    | U+0B3E‚ÄìU+0B44, U+0B47‚ÄìU+0B48, U+0B4B‚ÄìU+0B4C | ‡≠Ä, ‡≠Å, ‡≠Ç, ‡≠É, ‡≠à                      | ‡≠Ä, ‡≠Ç                        |
| **Virama (Halant)**                   | U+0B4D                                      | ‡≠ç                                  | used for consonant clusters |
| **Anusvara / Visarga / Chandrabindu** | U+0B01‚ÄìU+0B03                               | ‡¨Å, ‡¨Ç, ‡¨É                            | ‡¨Ç                           |
| **Digits (Odia numerals)**            | U+0B66‚ÄìU+0B6F                               | ‡≠¶‚Äì‡≠Ø                                | ‡≠ß, ‡≠®                        |
| **Punctuation & signs**               | U+0B70‚ÄìU+0B77                               | ‡≠∞, ‡≠±, ‡≠≤                            | ‡≠±                           |
| **Odia sign Nukta**                   | U+0B3C                                      | ‡≠ú ‚Üí ‡¨° + Nukta                      | ‡≠ú                           |
| **Odia sign Avagraha**                | U+0B3D                                      | ‡¨Ω                                  | Used in Sanskrit contexts   |


`ODIA_SPLIT_PATTERN = r"""[\u0B00-\u0B7F]+|\d+|[^\s\u0B00-\u0B7F\d]+| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""`

## 1. Import and Setup


In [1]:
from odia_bpe_tokenizer import OdiaBPETokenizer, ODIA_SPLIT_PATTERN
from dataset import load_odia_dataset

## 2. Load Dataset and Prepare Training Data


In [2]:
# Load dataset
ds = load_odia_dataset('./data')

# Use a subset for training
num_samples = 500000
training_text = ' '.join(ds['train']['text'][:num_samples])

print(f"Training text size: {len(training_text):,} characters")
print(f"Sample: {training_text[:100]}...")


Loading dataset from cache directory: ./data


Resolving data files:   0%|          | 0/25 [00:00<?, ?it/s]

Dataset loaded successfully!
Dataset structure: DatasetDict({
    train: Dataset({
        features: ['text', 'source'],
        num_rows: 5941930
    })
})
Training text size: 49,299,688 characters
Sample: ‡¨Ü‡¨á‡¨∏‡¨ø‡¨∏‡¨ø ‡¨¨‡¨ø‡¨∂‡≠ç‡¨¨‡¨ï‡¨™‡¨∞‡≠á ‡¨∏‡≠á ‡¨Ö‡¨Ç‡¨∂ ‡¨ó‡≠ç‡¨∞‡¨π‡¨£ ‡¨ï‡¨∞‡¨ø‡¨•‡¨ø‡¨¨‡¨æ‡¨∞‡≠Å ‡¨∏‡≠á ‡¨®‡¨ø‡¨ú‡¨ï‡≠Å ‡¨≠‡¨æ‡¨ó‡≠ç‡≠ü‡¨∂‡¨æ‡¨≥‡≠Ä ‡¨Æ‡¨®‡≠á ‡¨ï‡¨∞‡≠Å‡¨õ‡¨®‡≠ç‡¨§‡¨ø ‡•§ ‡¨ï‡¨∞‡≠Ä‡¨®‡¨æ ‡¨ï‡¨™‡≠Ç‡¨∞ ‡¨π‡≠á‡¨â‡¨õ‡¨®‡≠ç‡¨§‡¨ø ‡¨≠‡¨æ‡¨∞‡¨§‡≠Ä‡≠ü...


## 3. Train Tokenizer


In [3]:
# Create and train tokenizer
tokenizer = OdiaBPETokenizer(pattern=ODIA_SPLIT_PATTERN)

print("Training tokenizer...")
stats = tokenizer.train(training_text, vocab_size=10000, verbose=True)


Training tokenizer...
Text split into 14827345 chunks
merge 1/9744: (224, 172) -> 256 (ÔøΩ) had 31586333 occurrences
merge 2/9744: (224, 173) -> 257 (ÔøΩ) had 9102226 occurrences
merge 3/9744: (256, 190) -> 258 (‡¨æ) had 3680563 occurrences
merge 4/9744: (257, 141) -> 259 (‡≠ç) had 3621787 occurrences
merge 5/9744: (256, 176) -> 260 (‡¨∞) had 3412710 occurrences
merge 6/9744: (256, 191) -> 261 (‡¨ø) had 3381404 occurrences
merge 7/9744: (259, 256) -> 262 (‡≠çÔøΩ) had 2288435 occurrences
merge 8/9744: (261, 256) -> 263 (‡¨øÔøΩ) had 2233023 occurrences
merge 9/9744: (258, 256) -> 264 (‡¨æÔøΩ) had 2157223 occurrences
merge 10/9744: (257, 135) -> 265 (‡≠á) had 1808806 occurrences
merge 11/9744: (256, 149) -> 266 (‡¨ï) had 1349971 occurrences
merge 12/9744: (257, 129) -> 267 (‡≠Å) had 1148230 occurrences
merge 13/9744: (256, 168) -> 268 (‡¨®) had 1094714 occurrences
merge 14/9744: (256, 184) -> 269 (‡¨∏) had 1040635 occurrences
merge 15/9744: (256, 172) -> 270 (‡¨¨) had 1031023 occurrences


## 4. Test Encoding and Decoding


In [4]:
# Test on example texts
test_text = "‡¨ï‡¨æ‡¨∞‡≠ç‡¨§‡≠ç‡¨§‡¨ø‡¨ï ‡¨™‡≠Ç‡¨∞‡≠ç‡¨£‡≠ç‡¨£‡¨ø‡¨Æ‡¨æ ‡¨â‡¨™‡¨≤‡¨ï‡≠ç‡¨∑‡≠á ‡¨∏‡¨Æ‡¨∏‡≠ç‡¨§‡¨ô‡≠ç‡¨ï‡≠Å ‡¨π‡¨æ‡¨∞‡≠ç‡¨¶‡≠ç‡¨¶‡¨ø‡¨ï ‡¨∂‡≠Å‡¨≠‡≠á‡¨ö‡≠ç‡¨õ‡¨æ ‡¨Ü‡¨â ‡¨∂‡≠Å‡¨≠‡¨ï‡¨æ‡¨Æ‡¨®‡¨æ‡•§ ‡¨¨‡¨®‡≠ç‡¨¶‡≠á ‡¨â‡¨§‡≠ç‡¨ï‡¨≥ ‡¨ú‡¨®‡¨®‡≠Ä"

# Encode
encoded = tokenizer.encode(test_text)
print(f"Original: {test_text}")
print(f"Encoded:  {encoded}")
print(f"Tokens:   {len(encoded)}")

# Decode
decoded = tokenizer.decode(encoded)
print(f"Decoded:  {decoded}")
print(f"Match:    {test_text == decoded}")

# Compression stats
comp_stats = tokenizer.get_compression_stats(test_text)
print(f"\nCompression ratio: {comp_stats['compression_ratio']:.2f}x")
print(f"Bytes per token:   {comp_stats['bytes_per_token']:.2f}")


Original: ‡¨ï‡¨æ‡¨∞‡≠ç‡¨§‡≠ç‡¨§‡¨ø‡¨ï ‡¨™‡≠Ç‡¨∞‡≠ç‡¨£‡≠ç‡¨£‡¨ø‡¨Æ‡¨æ ‡¨â‡¨™‡¨≤‡¨ï‡≠ç‡¨∑‡≠á ‡¨∏‡¨Æ‡¨∏‡≠ç‡¨§‡¨ô‡≠ç‡¨ï‡≠Å ‡¨π‡¨æ‡¨∞‡≠ç‡¨¶‡≠ç‡¨¶‡¨ø‡¨ï ‡¨∂‡≠Å‡¨≠‡≠á‡¨ö‡≠ç‡¨õ‡¨æ ‡¨Ü‡¨â ‡¨∂‡≠Å‡¨≠‡¨ï‡¨æ‡¨Æ‡¨®‡¨æ‡•§ ‡¨¨‡¨®‡≠ç‡¨¶‡≠á ‡¨â‡¨§‡≠ç‡¨ï‡¨≥ ‡¨ú‡¨®‡¨®‡≠Ä
Encoded:  [6517, 32, 9574, 32, 8115, 32, 3564, 32, 8816, 32, 5799, 32, 928, 32, 1963, 6851, 224, 165, 164, 32, 951, 265, 32, 4590, 32, 706, 681]
Tokens:   27
Decoded:  ‡¨ï‡¨æ‡¨∞‡≠ç‡¨§‡≠ç‡¨§‡¨ø‡¨ï ‡¨™‡≠Ç‡¨∞‡≠ç‡¨£‡≠ç‡¨£‡¨ø‡¨Æ‡¨æ ‡¨â‡¨™‡¨≤‡¨ï‡≠ç‡¨∑‡≠á ‡¨∏‡¨Æ‡¨∏‡≠ç‡¨§‡¨ô‡≠ç‡¨ï‡≠Å ‡¨π‡¨æ‡¨∞‡≠ç‡¨¶‡≠ç‡¨¶‡¨ø‡¨ï ‡¨∂‡≠Å‡¨≠‡≠á‡¨ö‡≠ç‡¨õ‡¨æ ‡¨Ü‡¨â ‡¨∂‡≠Å‡¨≠‡¨ï‡¨æ‡¨Æ‡¨®‡¨æ‡•§ ‡¨¨‡¨®‡≠ç‡¨¶‡≠á ‡¨â‡¨§‡≠ç‡¨ï‡¨≥ ‡¨ú‡¨®‡¨®‡≠Ä
Match:    True

Compression ratio: 8.93x
Bytes per token:   8.93


## 5. Save and Load Tokenizer


In [5]:
# Save tokenizer
tokenizer.save("odia_bpe_tokenizer")

# Load it back
loaded_tokenizer = OdiaBPETokenizer()
loaded_tokenizer.load("odia_bpe_tokenizer.model")

# Test that it works
test = "‡¨ì‡¨°‡¨º‡¨ø‡¨Ü"
original_encoded = tokenizer.encode(test)
loaded_encoded = loaded_tokenizer.encode(test)

print(f"Original: {original_encoded}")
print(f"Loaded:   {loaded_encoded}")
print(f"Match:    {original_encoded == loaded_encoded}")


Tokenizer saved to odia_bpe_tokenizer.model and odia_bpe_tokenizer.vocab
Tokenizer loaded from odia_bpe_tokenizer.model
Vocabulary size: 10000
Number of merges: 9744
Original: [2060]
Loaded:   [2060]
Match:    True
