# E5 Large Instruct ONNX Conversion Research

This notebook documents the conversion of the `intfloat/multilingual-e5-large-instruct` model to ONNX format, enabling cross-platform usage with identical results to the original HuggingFace implementation.

## Key Features:
1. Conversion of E5 Large Instruct model from HuggingFace to ONNX
2. Conversion of tokenizer to ONNX using ONNX Extensions
3. Implementation of average pooling and L2 normalization
4. Verification of identical behavior between ONNX and HuggingFace implementations

## Import Required Libraries

In [None]:
%pip install onnx onnxruntime onnxruntime-extensions numpy torch transformers

In [17]:
import onnx
import onnxruntime as ort
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from onnxruntime_extensions import gen_processing_models, get_library_path
from transformers import AutoTokenizer, AutoModel
import os

## 1. E5 Model Implementation and ONNX Wrapper

In [18]:
def average_pool(last_hidden_states, attention_mask):
    """Average pooling implementation matching the original"""
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def get_detailed_instruct(task_description: str, query: str) -> str:
    """Format query with instruction as required by E5"""
    return f'Instruct: {task_description}\nQuery: {query}'

class E5LargeInstructONNXWrapper(nn.Module):
    """Wrapper class to make E5 Large Instruct compatible with ONNX export"""
    def __init__(self, model):
        super().__init__()
        self.model = model
        
    def forward(self, input_ids, attention_mask):
        # Get model outputs
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        
        # Apply average pooling
        embeddings = average_pool(outputs.last_hidden_state, attention_mask)
        
        # Apply L2 normalization
        normalized_embeddings = F.normalize(embeddings, p=2, dim=1)
        
        return normalized_embeddings

## 2. Export E5 Large Instruct Model to ONNX

In [19]:
def export_e5_large_instruct_model(model_name_or_path="intfloat/multilingual-e5-large-instruct", output_path="e5_large_instruct_model.onnx"):
    """Export E5 Large Instruct model to ONNX format"""
    print(f"Loading E5 Large Instruct model from {model_name_or_path}")
    
    # Load the model
    model = AutoModel.from_pretrained(model_name_or_path)
    
    # Wrap the model
    onnx_model = E5LargeInstructONNXWrapper(model)
    onnx_model.eval()
    
    # Create dummy input
    dummy_input_ids = torch.randint(0, 1000, (1, 512), dtype=torch.long)
    dummy_attention_mask = torch.ones(1, 512, dtype=torch.long)
    
    print("Exporting model to ONNX...")
    
    # Export to ONNX
    torch.onnx.export(
        onnx_model,
        (dummy_input_ids, dummy_attention_mask),
        output_path,
        input_names=['input_ids', 'attention_mask'],
        output_names=['embeddings'],
        dynamic_axes={
            'input_ids': {0: 'batch_size', 1: 'sequence_length'},
            'attention_mask': {0: 'batch_size', 1: 'sequence_length'},
            'embeddings': {0: 'batch_size'}
        },
        opset_version=20,
        export_params=True
    )
    
    print(f"Model exported to: {output_path}")
    
    # Load the exported model and save with external data format
    print("Converting to external data format...")
    
    model = onnx.load(output_path)
    
    # Get directory and filename parts
    output_dir = os.path.dirname(output_path)
    base_filename = os.path.basename(output_path)
    data_filename = base_filename.replace('.onnx', '.onnx_data')
    data_path = os.path.join(output_dir, data_filename)
    
    onnx.save_model(
        model, 
        output_path,
        save_as_external_data=True,
        all_tensors_to_one_file=True,
        location=data_filename
    )
    
    print(f"✅ Standard format export completed!")
    print(f"   Model graph: {output_path}")
    print(f"   Model data:  {data_path}")
    
    return output_path, data_path

## 3. Export E5 Tokenizer to ONNX

In [21]:
def export_e5_tokenizer_to_onnx(model_name="intfloat/multilingual-e5-large-instruct", tokenizer_path="onnx/e5_large_instruct_tokenizer.onnx"):
    """Export the E5 tokenizer to ONNX format"""
    print(f"Loading tokenizer from {model_name}")
    hf_tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    if not os.path.exists(tokenizer_path):
        print(f"Generating ONNX tokenizer at {tokenizer_path}")
        tokenizer_model = gen_processing_models(hf_tokenizer, pre_kwargs={}, post_kwargs={})[0]
        
        # Ensure directory exists
        os.makedirs(os.path.dirname(tokenizer_path), exist_ok=True)
        
        with open(tokenizer_path, "wb") as f:
            f.write(tokenizer_model.SerializeToString())
        print(f"✅ Tokenizer exported to {tokenizer_path}")
    else:
        print(f"Using existing tokenizer at {tokenizer_path}")
    
    return hf_tokenizer, tokenizer_path

## 4. ONNX E5 Implementation

In [22]:
def convert_tokenizer_outputs(tokens, token_indices):
    """Convert tokenizer outputs to model input format"""
    # Pair tokens with their indices and sort by position
    token_pairs = list(zip(token_indices, tokens))
    token_pairs.sort()  # Sort by position (token_indices)
    
    # Get ordered tokens
    ordered_tokens = [pair[1] for pair in token_pairs]
    
    # Create input_ids and attention_mask
    input_ids = np.array([ordered_tokens], dtype=np.int64)
    attention_mask = np.ones_like(input_ids, dtype=np.int64)
    
    return input_ids, attention_mask

class OnnxE5LargeInstructEmbedder:
    """E5 Large Instruct embedder using ONNX tokenizer and model"""
    
    def __init__(self, tokenizer_path, model_path):
        """Initialize the embedder with ONNX tokenizer and model"""
        # Initialize tokenizer session
        sess_options = ort.SessionOptions()
        sess_options.register_custom_ops_library(get_library_path())
        self.tokenizer_session = ort.InferenceSession(
            tokenizer_path,
            sess_options=sess_options,
            providers=['CPUExecutionProvider']
        )
        
        # Initialize model session
        self.model_session = ort.InferenceSession(
            model_path,
            providers=['CPUExecutionProvider']
        )
    
    @staticmethod
    def get_detailed_instruct(task_description: str, query: str) -> str:
        """Format query with instruction as required by E5"""
        return f'Instruct: {task_description}\nQuery: {query}'
    
    def encode(self, texts):
        """Generate embeddings for the input texts"""
        if isinstance(texts, str):
            texts = [texts]
        
        embeddings = []
        
        for text in texts:
            # Tokenize the input
            tokenizer_outputs = self.tokenizer_session.run(None, {"inputs": np.array([text])})
            tokens, _, token_indices = tokenizer_outputs
            
            # Convert to model input format
            input_ids, attention_mask = convert_tokenizer_outputs(tokens, token_indices)
            
            # Generate embeddings
            model_outputs = self.model_session.run(None, {
                "input_ids": input_ids,
                "attention_mask": attention_mask
            })
            
            # Extract normalized embeddings
            embedding = model_outputs[0][0]  # Remove batch dimension
            embeddings.append(embedding)
        
        return np.array(embeddings) if len(embeddings) > 1 else embeddings[0]

## 5. Export Models and Compare Implementations

In [None]:
# Create onnx directory if it doesn't exist
onnx_dir = "./onnx"
if not os.path.exists(onnx_dir):
    os.makedirs(onnx_dir)

# Export tokenizer
tokenizer_path = os.path.join(onnx_dir, "e5_large_instruct_tokenizer.onnx")
hf_tokenizer, tokenizer_path = export_e5_tokenizer_to_onnx(tokenizer_path=tokenizer_path)

# Export model
model_path = os.path.join(onnx_dir, "e5_large_instruct_model.onnx")
data_path = os.path.join(onnx_dir, "e5_large_instruct_model.onnx_data")

if not os.path.exists(model_path) or not os.path.exists(data_path):
    model_path, data_path = export_e5_large_instruct_model(output_path=model_path)
else:
    print(f"Using existing ONNX model at {model_path}")

In [23]:
def compare_e5_embeddings():
    """Compare embeddings from ONNX vs original HuggingFace implementation"""
    print("Comparing E5 Large Instruct embeddings...")
    
    # Test data from the original example
    task = 'Given a web search query, retrieve relevant passages that answer the query'
    queries = [
        'how much protein should a female eat',
        '南瓜的家常做法'
    ]
    documents = [
        "As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
        "1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精 做法: 1、南瓜用刀薄薄的削去表面一层皮,用勺子刮去瓤 2、擦成细丝(没有擦菜板就用刀慢慢切成细丝) 3、锅烧热放油,入葱花煸出香味 4、入南瓜丝快速翻炒一分钟左右,放盐、一点白糖和鸡精调味出锅 2.香葱炒南瓜 原料:南瓜1只 调料:香葱、蒜末、橄榄油、盐 做法: 1、将南瓜去皮,切成片 2、油锅8成热后,将蒜末放入爆香 3、爆香后,将南瓜片放入,翻炒 4、在翻炒的同时,可以不时地往锅里加水,但不要太多 5、放入盐,炒匀 6、南瓜差不多软和绵了之后,就可以关火 7、撒入香葱,即可出锅"
    ]
    
    # Prepare input texts
    instruct_queries = [get_detailed_instruct(task, query) for query in queries]
    input_texts = instruct_queries + documents
    
    # Original HuggingFace implementation
    tokenizer = AutoTokenizer.from_pretrained('intfloat/multilingual-e5-large-instruct')
    model = AutoModel.from_pretrained('intfloat/multilingual-e5-large-instruct')
    
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**batch_dict)
    hf_embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    hf_embeddings = F.normalize(hf_embeddings, p=2, dim=1)
    
    # ONNX implementation
    onnx_embedder = OnnxE5LargeInstructEmbedder(
        tokenizer_path=os.path.join(onnx_dir, "e5_large_instruct_tokenizer.onnx"),
        model_path=os.path.join(onnx_dir, "e5_large_instruct_model.onnx")
    )
    
    onnx_embeddings = onnx_embedder.encode(input_texts)
    
    # Compare embeddings
    print("\n=== COMPARISON ===")
    for i, _ in enumerate(input_texts):
        hf_emb = hf_embeddings[i].detach().numpy()
        onnx_emb = onnx_embeddings[i]
        
        similarity = np.dot(hf_emb, onnx_emb) / (np.linalg.norm(hf_emb) * np.linalg.norm(onnx_emb))
        max_diff = np.abs(hf_emb - onnx_emb).max()
        
        print(f"Text {i+1}: Similarity={similarity:.10f}, Max diff={max_diff:.10f}")
        print(f"  First 5 values (HF):   {hf_emb[:5]}")
        print(f"  First 5 values (ONNX): {onnx_emb[:5]}")
    
    # Calculate similarity scores like in the original example
    print("\n=== SIMILARITY SCORES ===")
    hf_scores = (hf_embeddings[:2] @ hf_embeddings[2:].T) * 100
    onnx_scores = (onnx_embeddings[:2] @ onnx_embeddings[2:].T) * 100
    
    print(f"HuggingFace scores: {hf_scores.tolist()}")
    print(f"ONNX scores:        {onnx_scores.tolist()}")
    
    # Check if scores are close (FIX: Use .detach().numpy())
    score_diff = np.abs(hf_scores.detach().numpy() - onnx_scores).max()
    print(f"Maximum score difference: {score_diff:.6f}")
    
    if score_diff < 0.01:
        print("\n✅ CONCLUSION: The ONNX and HuggingFace outputs match very closely!")
    else:
        print("\n⚠️ CONCLUSION: There are some differences between the ONNX and HuggingFace outputs.")

compare_e5_embeddings()

Comparing E5 Large Instruct embeddings...

=== COMPARISON ===
Text 1: Similarity=1.0000002384, Max diff=0.0000002310
  First 5 values (HF):   [ 0.02018587  0.01120439 -0.04514243 -0.03949937  0.01756799]
  First 5 values (ONNX): [ 0.02018598  0.01120445 -0.04514245 -0.03949942  0.01756793]
Text 2: Similarity=1.0000000000, Max diff=0.0000002887
  First 5 values (HF):   [ 0.04857631  0.05938286  0.00398631 -0.02070856  0.02414943]
  First 5 values (ONNX): [ 0.04857632  0.05938292  0.00398623 -0.02070858  0.02414938]
Text 3: Similarity=1.0000000000, Max diff=0.0000002738
  First 5 values (HF):   [ 0.0326695   0.00413252 -0.05025513 -0.02161583  0.03252118]
  First 5 values (ONNX): [ 0.03266948  0.00413257 -0.05025515 -0.02161592  0.03252107]
Text 4: Similarity=1.0000000000, Max diff=0.0000004210
  First 5 values (HF):   [ 0.04768918  0.05657836  0.00922346 -0.0189132   0.01720553]
  First 5 values (ONNX): [ 0.04768923  0.05657855  0.00922368 -0.01891332  0.01720552]

=== SIMILARITY SCORES

## Conclusions

Our research has successfully achieved:

1. **Complete E5 Large Instruct Model Conversion**: Successfully converted the model from HuggingFace to ONNX format, including average pooling and L2 normalization.

2. **Tokenizer Conversion**: Converted the tokenizer from HuggingFace to ONNX format using ONNX Extensions.

3. **Instruction Formatting**: Maintained the instruction formatting requirement that makes E5 effective for retrieval tasks.

4. **Identical Behavior**: The ONNX implementation produces essentially identical outputs to the original HuggingFace implementation.

This conversion enables E5 Large Instruct to be used in cross-platform applications, particularly in C# and Java, while maintaining full functionality and accuracy.