# FT Tokenize

FT Tokenize is a C++17 tokenizer library with Python bindings via pybind11, created as a **personal project** to better understand tokenization. It supports both **word-level** and **BPE (Byte Pair Encoding)** tokenization, making it flexible for different use cases. You can train it on your own text files, save/load models, and encode/decode text into token IDs or tokens.

### Key Methods:

* `train_from_textfile(input_file, vocab_size, user_defined_symbols, mode)` – Train the tokenizer, choosing between **WORD** or **BPE** mode.
* `train_word_level(...)` / `train_bpe(...)` – Train word-level or BPE tokenizers.
* `save_model(model_path)` / `load_model(model_path)` – Save or load the trained model.
* `encode_as_ids(text)` / `encode_as_tokens(text)` – Convert text to token IDs or tokens.
* `decode_ids(ids)` / `decode_tokens(tokens)` – Convert token IDs or tokens back to text.
* `token_to_id(token)` / `id_to_token(id)` – Lookup token ID or the token for a given ID.
* `get_token_size()` / `get_vocab()` – Get the size of the vocabulary or the list of tokens.


In [1]:
pip install ft-tokenize

Collecting ft-tokenize
  Downloading ft_tokenize-0.1.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.4 kB)
Downloading ft_tokenize-0.1.7-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (596 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m596.2/596.2 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ft-tokenize
Successfully installed ft-tokenize-0.1.7
Note: you may need to restart the kernel to use updated packages.


In [2]:
import ft_tokenize

help(ft_tokenize)

Help on package ft_tokenize:

NAME
    ft_tokenize - #ft_tokenize/__init__.py

PACKAGE CONTENTS
    ft_tokenize

DATA
    BPE = <TokenizerMode.BPE: 1>
    WORD = <TokenizerMode.WORD: 0>

FILE
    /usr/local/lib/python3.11/dist-packages/ft_tokenize/__init__.py




In [3]:
import pandas as pd
import numpy as np
import os

pd.set_option('display.max_colwidth', None)
df = pd.read_excel("/kaggle/input/inshorts/Inshorts Cleaned Data.xlsx")
df = df[['Short']].sample(n=1000, random_state=42).reset_index(drop=True)

print(df.shape)

(1000, 1)


In [4]:
df.head(5)

Unnamed: 0,Short
0,"Gangster-turned-politician Mukhtar Ansari has won from the Mau constituency in Uttar Pradesh after polling 96,793 votes, defeating the nearest candidate by over 8,000 votes. Ansari, who was the sitting MLA from the constituency, had allied with the Mayawati-led Bahujan Samaj Party before the elections. Ansari has been accused of murdering a BJP MLA."
1,"Indira Gandhi has been the only woman till date to have presented the Union Budget of India in 1970-71. This came after Indira Gandhi, the then Prime Minister, took over the Finance portfolio after Morarji Desai resigned as the Minister of Finance. So far, she has been the only woman Finance Minister of India."
2,"Actor Adam Sandler brought his 23-year-old doppelgänger Max Kessler for the premiere of his upcoming film &#39;The Do-Over&#39;. Sandler noticed Max after the 23-year-old posted a picture of himself alongside Sandler on Reddit. It was captioned, &#34;The name of Adam Sandler&#39;s character in... &#39;The Do-Over&#39; is Max Kessler. My name is Max Kessler... and I look just like him.&#34;"
3,"The Supreme Court recently reminded the Centre that Aadhaar cannot be made mandatory for any services. The apex court also ordered the Centre to remove its condition of making Aadhaar mandatory in scholarship schemes for students. &#34;The Aadhaar card Scheme is purely voluntary and cannot be made mandatory till the matter is finally decided by this Court,&#34; the SC added."
4,"Researchers at the University of Stuttgart have built wall-climbing mini robots that work together to create architecture from carbon fibre. The robots carry carbon fibre thread spools that they pass back and forth after affixing to points on a wall. The researchers are planning to increase the number of robots, allowing them to attach fibre to ceilings and curved walls."


In [5]:
import re

def clean_text(t):
    if isinstance(t, float):
        return ""
    t = re.sub(r"http\S+", "", t)
    t = re.sub(r"[^A-Za-z0-9 ,.'’\-?()]", " ", t)
    t = re.sub(r"\s+", " ", t).strip()
    return t


df['Short'] = df['Short'].apply(clean_text)

In [6]:
text_file = "text_sample.txt"

with open(text_file, "w", encoding="utf-8") as f:
        for t in df['Short']:
            f.write(t + "\n")

In [7]:
vocab_size = 5000
user_defined_symbols = ["<SOS>", "<EOS>", " "] 


# WORD
word_model_file = "word_model.model"
mode = ft_tokenize.TokenizerMode.WORD 

word_tokenizer = ft_tokenize.TokenizerModel()
if os.path.exists(word_model_file):
    word_tokenizer.load_model(word_model_file)
    print("\nWord model loaded")
else:
    word_tokenizer.train_from_textfile(text_file, vocab_size=vocab_size, user_defined_symbols=user_defined_symbols, mode=mode)
    word_tokenizer.save_model(word_model_file)


# BPE
BPE_model_file = "bpe_model.model"
mode = ft_tokenize.TokenizerMode.BPE 

BPE_tokenizer = ft_tokenize.TokenizerModel()
if os.path.exists(BPE_model_file):
    BPE_tokenizer.load_model(BPE_model_file)
    print("\nBPE model loaded")
else:
    BPE_tokenizer.train_from_textfile(text_file, vocab_size=vocab_size, user_defined_symbols=user_defined_symbols, mode=mode)
    BPE_tokenizer.save_model(BPE_model_file)

In [8]:
print("\nVocabulary size:", word_tokenizer.get_token_size())
print("First 20 tokens:", word_tokenizer.get_vocab()[:20])


Vocabulary size: 5007
First 20 tokens: ['<pad>', '<unk>', '<sos>', '<eos>', 'the', 'to', 'of', '39', 'in', 'a', 'and', '34', 'on', 'The', 's', 'for', 'has', 'is', 'that', 'by']


In [9]:
sample_text = df['Short'].iloc[0]


ids = word_tokenizer.encode_as_ids(sample_text)
tokens = word_tokenizer.encode_as_tokens(sample_text)
decoded_text = word_tokenizer.decode_ids(ids)

print("\nOriginal text:", sample_text)


print("\n\nWord encoding ")
print("\nEncoded IDs:", ids)
print("\nTokens:", tokens)
print("\nDecoded text:", decoded_text)


ids = BPE_tokenizer.encode_as_ids(sample_text)
tokens = BPE_tokenizer.encode_as_tokens(sample_text)
decoded_text = BPE_tokenizer.decode_ids(ids)

print("\n\nByte-Pair encoding ")
print("\nEncoded IDs:", ids)
print("\nTokens:", tokens)
print("\nDecoded text:", decoded_text)


Original text: Gangster-turned-politician Mukhtar Ansari has won from the Mau constituency in Uttar Pradesh after polling 96,793 votes, defeating the nearest candidate by over 8,000 votes. Ansari, who was the sitting MLA from the constituency, had allied with the Mayawati-led Bahujan Samaj Party before the elections. Ansari has been accused of murdering a BJP MLA.


Word encoding 

Encoded IDs: [1, 1, 4823, 16, 140, 22, 4, 1, 1, 8, 289, 188, 30, 1947, 1, 4551, 1124, 4, 1, 1320, 19, 43, 4908, 4550, 1, 44, 20, 4, 3134, 780, 22, 4, 1, 36, 1, 21, 4, 1, 1, 1, 551, 138, 4, 1182, 4823, 16, 41, 219, 6, 4005, 9, 135, 1]

Tokens: ['<unk>', '<unk>', 'Ansari', 'has', 'won', 'from', 'the', '<unk>', '<unk>', 'in', 'Uttar', 'Pradesh', 'after', 'polling', '<unk>', 'votes,', 'defeating', 'the', '<unk>', 'candidate', 'by', 'over', '8,000', 'votes.', '<unk>', 'who', 'was', 'the', 'sitting', 'MLA', 'from', 'the', '<unk>', 'had', '<unk>', 'with', 'the', '<unk>', '<unk>', '<unk>', 'Party', 'before', 'the',

In [10]:
sample_token = tokens[0]
token_id = word_tokenizer.token_to_id(sample_token)
print(f"\nToken:'{sample_token}' ID: {token_id}")

sample_id = ids[0]
id_token = word_tokenizer.id_to_token(sample_id)
print(f"ID: {sample_id}  Token: '{id_token}'")


Token:'Gang' ID: 1
ID: 2867  Token: 'tranche'
