# ONNX Tokenizer Interoperability Example

This Jupyter notebook demonstrates converting a tokenizer for an embedding model to ONNX format, using `BAAI/bge-m3` as an example. The ONNX tokenizer, based on XLM-RoBERTa Fast for `bge-m3`, enables **interoperable** text processing in any language supporting ONNX Extensions, such as Python, C#, C++, Java, or JavaScript. In this pipeline, a critical step is converting the ONNX tokenizer outputs (`tokens`, `instance_indices`, `token_indices`) into model-compatible inputs (`input_ids`, `attention_mask`) to work with the transformer model.

## Install Required Packages

First, let's install all the necessary packages for this notebook.

In [None]:
!pip install onnxruntime onnxruntime-extensions numpy transformers

## Imports

Import libraries for ONNX runtime, NumPy, Hugging Face transformers, and file operations. These enable tokenizer conversion and embedding generation.

In [14]:
import onnxruntime as ort
import numpy as np
from onnxruntime_extensions import gen_processing_models, get_library_path
from transformers import AutoTokenizer
import os

## Initialize Tokenizer

Initialize a Hugging Face tokenizer (using `BAAI/bge-m3` as an example) and convert it to ONNX format. The ONNX tokenizer can be deployed in any language that supports ONNX Extensions.

In [15]:
def initialize_tokenizer(model_type="BAAI/bge-m3", tokenizer_path="tokenizer.onnx"):
    """Initialize and export the tokenizer if needed.

    Args:
        model_type (str): Hugging Face model name (e.g., 'BAAI/bge-m3').
        tokenizer_path (str): Path to save/load the ONNX tokenizer.

    Returns:
        tuple: Hugging Face tokenizer and ONNX tokenizer session.
    """
    hf_tokenizer = AutoTokenizer.from_pretrained(model_type)
    
    if not os.path.exists(tokenizer_path):
        print(f"Generating ONNX tokenizer at {tokenizer_path}")
        tokenizer_model = gen_processing_models(hf_tokenizer, pre_kwargs={}, post_kwargs={})[0]
        with open(tokenizer_path, "wb") as f:
            f.write(tokenizer_model.SerializeToString())
    
    sess_options = ort.SessionOptions()
    sess_options.register_custom_ops_library(get_library_path())
    
    tokenizer_session = ort.InferenceSession(
        tokenizer_path, 
        sess_options=sess_options, 
        providers=['CPUExecutionProvider']
    )
    
    return hf_tokenizer, tokenizer_session

## Convert Tokenizer Outputs

Convert ONNX tokenizer outputs (`tokens`, `token_indices`) to model inputs (`input_ids`, `attention_mask`). This simplified conversion ensures compatibility with the embedding model. The `attention_mask` is set to all 1s because this pipeline processes a single text without padding or truncation, indicating that all tokens in the sequence are valid and should be fully attended to by the model.

In [16]:
def convert_tokenizer_outputs(tokens, token_indices):
    """Convert tokenizer outputs to the format expected by the model.

    Args:
        tokens: Token IDs from ONNX tokenizer.
        token_indices: Token position indices.

    Returns:
        tuple: input_ids and attention_mask as NumPy arrays.
    """
    token_pairs = []
    for i in range(len(tokens)):
        if i < len(token_indices):
            token_pairs.append((token_indices[i], tokens[i]))
    
    token_pairs.sort()
    ordered_tokens = [pair[1] for pair in token_pairs]
    
    input_ids = np.array([ordered_tokens], dtype=np.int64)
    attention_mask = np.ones_like(input_ids, dtype=np.int64)
    
    return input_ids, attention_mask

## Generate Embedding

Generate an embedding for a single text using the ONNX tokenizer and model.

In [17]:
def generate_embedding(text, tokenizer_session, model_session):
    """Generate embedding for a single text.

    Args:
        text (str): Input text to generate embedding for.
        tokenizer_session: ONNX tokenizer session.
        model_session: ONNX model session.

    Returns:
        numpy.ndarray: The sentence embedding.
    """
    tokenizer_outputs = tokenizer_session.run(None, {"inputs": np.array([text])})
    tokens, _, token_indices = tokenizer_outputs
    
    input_ids, attention_mask = convert_tokenizer_outputs(
        tokens, token_indices
    )
    
    outputs = model_session.run(None, {
        "input_ids": input_ids,
        "attention_mask": attention_mask
    })
    
    return outputs[1]

## Compare with Hugging Face Tokenizer

Compare the embedding generated with the ONNX tokenizer (using `bge-m3` as an example) to the Hugging Face tokenizer's output, confirming that both tokenizers produce equivalent results via cosine similarity.

In [18]:
def compare_with_hf_tokenizer(text, hf_tokenizer, tokenizer_session, model_session):
    """Compare embeddings from ONNX and Hugging Face tokenizers.

    Args:
        text (str): Input text to tokenize and embed.
        hf_tokenizer: Hugging Face tokenizer.
        tokenizer_session: ONNX tokenizer session.
        model_session: ONNX model session.

    Returns:
        float: Cosine similarity between embeddings.
    """
    # ONNX tokenizer embedding
    onnx_embedding = generate_embedding(text, tokenizer_session, model_session)
    
    # Hugging Face tokenizer embedding
    hf_inputs = hf_tokenizer(text, return_tensors="np")
    hf_outputs = model_session.run(None, {
        "input_ids": hf_inputs["input_ids"],
        "attention_mask": hf_inputs["attention_mask"]
    })
    hf_embedding = hf_outputs[1]
    
    # Calculate cosine similarity
    cosine_sim = np.dot(onnx_embedding.flatten(), hf_embedding.flatten()) / (
        np.linalg.norm(onnx_embedding) * np.linalg.norm(hf_embedding)
    )
    
    return cosine_sim

## Main Execution

Test the ONNX tokenizer pipeline using `bge-m3` as an example. Generate an embedding for a sample text and compare it with the Hugging Face tokenizer's output. To run this pipeline, download `model.onnx` and `model.onnx_data` from https://huggingface.co/BAAI/bge-m3/tree/main/onnx and place them in an `onnx` folder. Note that the `onnx` folder is included in `.gitignore` by default.

In [19]:
# Initialize tokenizer and model
hf_tokenizer, tokenizer_session = initialize_tokenizer(tokenizer_path="onnx/tokenizer.onnx")
model_session = ort.InferenceSession("onnx/model.onnx", providers=['CPUExecutionProvider'])

# Test with a sample text
sample_text = "A test text! Texto de prueba! Текст для теста! 測試文字! Testtext! Testez le texte! Сынақ мәтіні! Тестни текст! परीक्षण पाठ! Kiểm tra văn bản!"
embedding = generate_embedding(sample_text, tokenizer_session, model_session)

print(f"Generated embedding shape: {embedding.shape}")
print(f"Sample values: {embedding.flatten()[:5]}")

# Compare with Hugging Face tokenizer
similarity = compare_with_hf_tokenizer(sample_text, hf_tokenizer, tokenizer_session, model_session)
print(f"Embedding cosine similarity: {similarity}")

Generated embedding shape: (1, 1024)
Sample values: [-0.00892851  0.02104793 -0.01595523 -0.03338689  0.00300002]
Embedding cosine similarity: 1.0
