# BGE-M3 ONNX Conversion Research Results

This notebook documents our research results for converting the BGE-M3 model to ONNX format, with three key achievements:

1. Conversion of BGE-M3 model (dense, sparse, and ColBERT outputs) from FlagEmbedding to ONNX
2. Conversion of BGE-M3 tokenizer from Hugging Face to ONNX using ONNX Extensions
3. Verification of identical behavior between ONNX and the original FlagEmbedding implementation

These conversions allow using BGE-M3 in any language supporting ONNX Runtime (C#, Java, etc.), with identical results to the FlagEmbedding implementation.

## Import Required Libraries

In [None]:
%pip install onnx onnxruntime onnxruntime-extensions numpy torch transformers FlagEmbedding

In [3]:
import onnx
import onnxruntime as ort
import numpy as np
import torch
import torch.nn as nn
from onnxruntime_extensions import gen_processing_models, get_library_path
from transformers import AutoTokenizer
from FlagEmbedding.inference.embedder.encoder_only.m3 import M3Embedder
import os

  from .autonotebook import tqdm as notebook_tqdm


## 1. Converting BGE-M3 Model to ONNX

First, we convert the BGE-M3 model from FlagEmbedding to ONNX with all three outputs (dense, sparse, ColBERT).

In [4]:
class BGE_M3_ONNX_Wrapper(nn.Module):
    """Wrapper class to make BGE-M3 compatible with ONNX export"""
    def __init__(self, m3_embedder):
        super().__init__()
        self.m3_model = m3_embedder.model
        
    def forward(self, input_ids, attention_mask):
        # Call the M3 model with all output types
        outputs = self.m3_model({
            'input_ids': input_ids, 
            'attention_mask': attention_mask
        }, 
        return_dense=True,
        return_sparse=True, 
        return_colbert_vecs=True)
        
        return (
            outputs['dense_vecs'],      # Dense embeddings
            outputs['sparse_vecs'],     # Sparse weights  
            outputs['colbert_vecs']     # ColBERT vectors
        )

In [7]:
def export_bge_m3_standard_format(model_name_or_path="BAAI/bge-m3", output_path="bge_m3_model.onnx"):
    """Export BGE-M3 model in standard ONNX format: model.onnx + model.onnx_data"""
    print(f"Loading BGE-M3 model from {model_name_or_path}")
    
    # Load the model
    embedder = M3Embedder(
        model_name_or_path=model_name_or_path,
        use_fp16=False,  # Use FP32 for ONNX export
        normalize_embeddings=True  # Match the default behavior
    )
    
    # Wrap the model
    onnx_model = BGE_M3_ONNX_Wrapper(embedder)
    onnx_model.eval()
    
    # Create dummy input
    dummy_input_ids = torch.randint(0, 1000, (1, 512), dtype=torch.long)
    dummy_attention_mask = torch.ones(1, 512, dtype=torch.long)
    
    print("Exporting model to ONNX...")
    
    # Export to ONNX
    torch.onnx.export(
        onnx_model,
        (dummy_input_ids, dummy_attention_mask),
        output_path,
        input_names=['input_ids', 'attention_mask'],
        output_names=['dense_embeddings', 'sparse_weights', 'colbert_vectors'],
        dynamic_axes={
            'input_ids': {0: 'batch_size', 1: 'sequence_length'},
            'attention_mask': {0: 'batch_size', 1: 'sequence_length'},
            'dense_embeddings': {0: 'batch_size'},
            'sparse_weights': {0: 'batch_size', 1: 'sequence_length'},
            'colbert_vectors': {0: 'batch_size', 1: 'sequence_length'}
        },
        opset_version=16,
        export_params=True,
        do_constant_folding=True
    )
    
    print(f"Model exported to: {output_path}")
    
    # Load the exported model and save with external data format
    print("Converting to external data format...")
    
    model = onnx.load(output_path)
    
    # Get directory and filename parts
    output_dir = os.path.dirname(output_path)
    base_filename = os.path.basename(output_path)
    data_filename = base_filename.replace('.onnx', '.onnx_data')
    data_path = os.path.join(output_dir, data_filename)
    
    onnx.save_model(
        model, 
        output_path,
        save_as_external_data=True,
        all_tensors_to_one_file=True,
        location=data_filename,  # Just the filename, not full path
        size_threshold=1024  # Save tensors larger than 1KB externally
    )
    
    print(f"✅ Standard format export completed!")
    print(f"   Model graph: {output_path}")
    print(f"   Model data:  {data_path}")
    
    return output_path, data_path

In [9]:
# Create onnx directory if it doesn't exist
onnx_dir = "./onnx"
if not os.path.exists(onnx_dir):
    os.makedirs(onnx_dir)
    
# Set export path
bge_m3_onnx_path = os.path.join(onnx_dir, "bge_m3_model.onnx")
bge_m3_data_path = os.path.join(onnx_dir, "bge_m3_model.onnx_data")

# Export model if it doesn't exist
if not os.path.exists(bge_m3_onnx_path) or not os.path.exists(bge_m3_data_path):
    bge_m3_onnx_path, bge_m3_data_path = export_bge_m3_standard_format(output_path=bge_m3_onnx_path)
else:
    print(f"Using existing ONNX model at {bge_m3_onnx_path}")

Loading BGE-M3 model from BAAI/bge-m3


Fetching 30 files: 100%|██████████| 30/30 [00:00<?, ?it/s]


Exporting model to ONNX...
Model exported to: ./onnx\bge_m3_model.onnx
Converting to external data format...
✅ Standard format export completed!
   Model graph: ./onnx\bge_m3_model.onnx
   Model data:  ./onnx\bge_m3_model.onnx_data


## 2. Converting BGE-M3 Tokenizer to ONNX

Next, we convert the BGE-M3 tokenizer from Hugging Face to ONNX using ONNX Extensions.

In [10]:
def export_tokenizer_to_onnx(model_name="BAAI/bge-m3", tokenizer_path="onnx/tokenizer.onnx"):
    """Export the tokenizer to ONNX format"""
    print(f"Loading tokenizer from {model_name}")
    hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    if not os.path.exists(tokenizer_path):
        print(f"Generating ONNX tokenizer at {tokenizer_path}")
        tokenizer_model = gen_processing_models(hf_tokenizer, pre_kwargs={}, post_kwargs={})[0]
        with open(tokenizer_path, "wb") as f:
            f.write(tokenizer_model.SerializeToString())
        print(f"✅ Tokenizer exported to {tokenizer_path}")
    else:
        print(f"Using existing tokenizer at {tokenizer_path}")
    
    return hf_tokenizer, tokenizer_path

In [11]:
# Export tokenizer
tokenizer_path = os.path.join(onnx_dir, "tokenizer.onnx")
hf_tokenizer, tokenizer_path = export_tokenizer_to_onnx(tokenizer_path=tokenizer_path)

Loading tokenizer from BAAI/bge-m3
Generating ONNX tokenizer at ./onnx\tokenizer.onnx
✅ Tokenizer exported to ./onnx\tokenizer.onnx


## 3. Comparing ONNX Implementation with FlagEmbedding

Now we compare the ONNX implementation with the original FlagEmbedding to verify identical behavior.

In [12]:
def convert_tokenizer_outputs(tokens, token_indices):
    """Convert tokenizer outputs to model input format"""
    # Pair tokens with their indices and sort by position
    token_pairs = list(zip(token_indices, tokens))
    token_pairs.sort()  # Sort by position (token_indices)
    
    # Get ordered tokens
    ordered_tokens = [pair[1] for pair in token_pairs]
    
    # Create input_ids and attention_mask
    input_ids = np.array([ordered_tokens], dtype=np.int64)
    attention_mask = np.ones_like(input_ids, dtype=np.int64)
    
    return input_ids, attention_mask

In [13]:
class OnnxBGEM3Embedder:
    """BGE-M3 embedder using ONNX tokenizer and model"""
    
    def __init__(self, tokenizer_path, model_path):
        """Initialize the embedder with ONNX tokenizer and model"""
        # Initialize tokenizer session
        sess_options = ort.SessionOptions()
        sess_options.register_custom_ops_library(get_library_path())
        self.tokenizer_session = ort.InferenceSession(
            tokenizer_path,
            sess_options=sess_options,
            providers=['CPUExecutionProvider']
        )
        
        # Initialize model session
        self.model_session = ort.InferenceSession(
            model_path,
            providers=['CPUExecutionProvider']
        )
        
        # Special token IDs for sparse weights filtering
        self.special_token_ids = {0, 1, 2, 3}  # Common special tokens: [PAD], [UNK], [CLS], [SEP]
    
    def encode(self, text, return_dense=True, return_sparse=False, return_colbert_vecs=False):
        """Generate embeddings for the input text"""
        # Tokenize the input
        tokenizer_outputs = self.tokenizer_session.run(None, {"inputs": np.array([text])})
        tokens, _, token_indices = tokenizer_outputs
        
        # Convert to model input format
        input_ids, attention_mask = convert_tokenizer_outputs(tokens, token_indices)
        
        # Generate embeddings
        model_outputs = self.model_session.run(None, {
            "input_ids": input_ids,
            "attention_mask": attention_mask
        })
        
        # Process outputs
        result = {}
        
        # ONNX outputs: dense_embeddings, sparse_weights, colbert_vectors
        dense_embeddings, sparse_weights, colbert_vectors = model_outputs
        
        # Process dense embeddings
        if return_dense:
            result["dense_vecs"] = dense_embeddings[0]  # First dimension is batch
        
        # Process sparse weights
        if return_sparse:
            # Convert to dictionary format like FlagEmbedding
            sparse_dict = {}
            for i, token_id in enumerate(input_ids[0]):
                if attention_mask[0, i] == 1 and token_id not in self.special_token_ids:
                    weight = sparse_weights[0, i]  # [batch, seq_len]
                    if weight > 0:
                        token_id_int = int(token_id)
                        sparse_dict[token_id_int] = max(sparse_dict.get(token_id_int, 0), float(weight))
            
            result["lexical_weights"] = sparse_dict
        
        # Process ColBERT vectors
        if return_colbert_vecs:
            # Convert to list format like FlagEmbedding
            colbert_list = []
            for i in range(colbert_vectors.shape[1]):  # Iterate over sequence length
                if attention_mask[0, i] == 1:  # Only include non-padding tokens
                    colbert_list.append(colbert_vectors[0, i])
            
            result["colbert_vecs"] = colbert_list
        
        return result

In [16]:
def compare_embeddings(text):
    """Compare embeddings from ONNX vs original FlagEmbedding"""
    print(f"Comparing embeddings for: '{text}'")
    
    # Load original FlagEmbedding model
    flag_embedder = M3Embedder(
        model_name_or_path="BAAI/bge-m3",
        normalize_embeddings=True
    )
    
    # Load our ONNX implementation
    onnx_embedder = OnnxBGEM3Embedder(
        tokenizer_path=os.path.join(onnx_dir, "tokenizer.onnx"),
        model_path=os.path.join(onnx_dir, "bge_m3_model.onnx")
    )
    
    # Generate embeddings with both implementations
    flag_outputs = flag_embedder.encode(
        text,
        return_dense=True,
        return_sparse=True,
        return_colbert_vecs=True
    )
    
    onnx_outputs = onnx_embedder.encode(
        text,
        return_dense=True,
        return_sparse=True,
        return_colbert_vecs=True
    )
    
    # Compare dense embeddings
    flag_dense = flag_outputs["dense_vecs"]
    onnx_dense = onnx_outputs["dense_vecs"]
    
    print("\n=== DENSE EMBEDDING COMPARISON ===")
    print(f"FlagEmbedding shape: {flag_dense.shape}, ONNX shape: {onnx_dense.shape}")
    print(f"First 10 values (Flag): {flag_dense[:10]}")
    print(f"First 10 values (ONNX): {onnx_dense[:10]}")
    
    # Compute cosine similarity for dense
    dense_similarity = np.dot(flag_dense, onnx_dense) / (
        np.linalg.norm(flag_dense) * np.linalg.norm(onnx_dense)
    )
    print(f"Dense cosine similarity: {dense_similarity:.10f}")
    dense_diff = np.abs(flag_dense - onnx_dense).max()
    print(f"Maximum element-wise difference: {dense_diff:.10f}")
    
    # Compare sparse weights
    flag_sparse = flag_outputs["lexical_weights"]
    onnx_sparse = onnx_outputs["lexical_weights"]
    
    print("\n=== SPARSE WEIGHTS COMPARISON ===")
    print(f"FlagEmbedding tokens: {len(flag_sparse)}, ONNX tokens: {len(onnx_sparse)}")
    
    # Compare top tokens
    flag_top = sorted(flag_sparse.items(), key=lambda x: x[1], reverse=True)[:5]
    onnx_top = sorted(onnx_sparse.items(), key=lambda x: x[1], reverse=True)[:5]
    
    print("Top 5 tokens (Flag):")
    for token_id, weight in flag_top:
        print(f"  {token_id}: {weight:.6f}")
        
    print("Top 5 tokens (ONNX):")
    for token_id, weight in onnx_top:
        print(f"  {token_id}: {weight:.6f}")
    
    # Compare ColBERT vectors
    flag_colbert = flag_outputs["colbert_vecs"]
    onnx_colbert = onnx_outputs["colbert_vecs"]
    
    print("\n=== COLBERT VECTORS COMPARISON ===")
    print(f"FlagEmbedding vectors: {len(flag_colbert)}, ONNX vectors: {len(onnx_colbert)}")
    
    if len(flag_colbert) > 0 and len(onnx_colbert) > 0:
        print(f"First vector dimension (Flag): {len(flag_colbert[0])}")
        print(f"First vector dimension (ONNX): {len(onnx_colbert[0])}")
        
        # Compare first vectors
        flag_first = flag_colbert[0]
        onnx_first = onnx_colbert[0]
        
        print(f"First vector, first 10 values (Flag): {flag_first[:10]}")
        print(f"First vector, first 10 values (ONNX): {onnx_first[:10]}")
        
        # Compute cosine similarity for first vectors
        colbert_similarity = np.dot(flag_first, onnx_first) / (
            np.linalg.norm(flag_first) * np.linalg.norm(onnx_first)
        )
        print(f"First vector cosine similarity: {colbert_similarity:.10f}")
    
    # Overall assessment
    if dense_similarity > 0.9999 and len(flag_sparse) == len(onnx_sparse):
        print("\n✅ CONCLUSION: The ONNX and FlagEmbedding outputs match very closely!")
    else:
        print("\n⚠️ CONCLUSION: There are some differences between the ONNX and FlagEmbedding outputs.")

In [17]:
# Test with a diverse multilingual text
test_text = "A test text! Texto de prueba! Текст для теста! 測試文字! Testtext! Testez le texte! Сынақ мәтіні! Тестни текст! परीक्षण पाठ! Kiểm tra văn bản!"
compare_embeddings(test_text)

Comparing embeddings for: 'A test text! Texto de prueba! Текст для теста! 測試文字! Testtext! Testez le texte! Сынақ мәтіні! Тестни текст! परीक्षण पाठ! Kiểm tra văn bản!'


Fetching 30 files: 100%|██████████| 30/30 [00:00<?, ?it/s]
You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  sparse_dict[token_id_int] = max(sparse_dict.get(token_id_int, 0), float(weight))



=== DENSE EMBEDDING COMPARISON ===
FlagEmbedding shape: (1024,), ONNX shape: (1024,)
First 10 values (Flag): [-0.00892847  0.02104794 -0.01595518 -0.03338685  0.00300004 -0.03306345
  0.02058692  0.05966044  0.01986757 -0.01925989]
First 10 values (ONNX): [-0.00892847  0.02104793 -0.0159552  -0.0333869   0.003      -0.03306337
  0.02058701  0.05966051  0.01986761 -0.01925998]
Dense cosine similarity: 1.0000000000
Maximum element-wise difference: 0.0000002235

=== SPARSE WEIGHTS COMPARISON ===
FlagEmbedding tokens: 34, ONNX tokens: 34
Top 5 tokens (Flag):
  3034: 0.335953
  7986: 0.329017
  90871: 0.307428
  15930: 0.300079
  22829: 0.298626
Top 5 tokens (ONNX):
  3034: 0.335953
  7986: 0.329017
  90871: 0.307428
  15930: 0.300079
  22829: 0.298626

=== COLBERT VECTORS COMPARISON ===
FlagEmbedding vectors: 45, ONNX vectors: 45
First vector dimension (Flag): 1024
First vector dimension (ONNX): 1024
First vector, first 10 values (Flag): [ 0.0267088   0.00919296 -0.00589903  0.03964164  0

## Conclusions

Our research has successfully achieved the following:

1. **Complete BGE-M3 Model Conversion**: We successfully converted the BGE-M3 model from FlagEmbedding to ONNX format, preserving all three output types (dense, sparse, and ColBERT vectors).

2. **Tokenizer Conversion**: We also converted the BGE-M3 tokenizer from Hugging Face to ONNX format using ONNX Extensions, enabling seamless text preprocessing in any ONNX-supported language.

3. **Identical Behavior**: Our comparison tests demonstrate that the ONNX implementation produces essentially identical outputs to the original FlagEmbedding implementation, with minimal differences due to floating-point precision.

These conversions make BGE-M3 available for use in cross-platform applications, particularly in languages like C# and Java, while maintaining the full functionality and accuracy of the original Python implementation.