# HTP Export to Optimum Inference Workflow

This notebook demonstrates the complete workflow:
1. Export a model using HTP exporter CLI
2. Copy HuggingFace config alongside ONNX model
3. Use Optimum ORTModel for inference

## Prerequisites
- modelexport package installed
- transformers and optimum packages
- ONNX Runtime

## Step 1: Setup and Imports

In [None]:
import os
import sys
import json
import shutil
from pathlib import Path
import subprocess

# Add project root to path
project_root = Path().absolute().parent.parent.parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")
print(f"Python path includes: {sys.path[0]}")

In [None]:
# Import required libraries
from transformers import AutoConfig, AutoTokenizer, AutoModel
import torch
import numpy as np

# Check available packages
try:
    from optimum.onnxruntime import ORTModel, ORTModelForFeatureExtraction
    print("‚úÖ Optimum ONNX Runtime support available")
except ImportError:
    print("‚ùå Optimum not installed. Install with: pip install optimum[onnxruntime]")

try:
    import onnxruntime
    print(f"‚úÖ ONNX Runtime version: {onnxruntime.__version__}")
except ImportError:
    print("‚ùå ONNX Runtime not installed. Install with: pip install onnxruntime")

## Step 2: Export BERT-tiny using HTP Exporter CLI

In [None]:
# Configuration
MODEL_NAME = "prajjwal1/bert-tiny"
EXPORT_DIR = project_root / "temp" / "bert-tiny-htp-export"
ONNX_FILE = EXPORT_DIR / "model.onnx"

# Create export directory
EXPORT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Export directory: {EXPORT_DIR}")

In [None]:
# Export using HTP exporter CLI
export_command = [
    sys.executable, "-m", "modelexport",
    "export",
    MODEL_NAME,
    str(ONNX_FILE),
    "--strategy", "htp",
    "--opset", "14",
    "--verbose"
]

print("üöÄ Running HTP export command:")
print(" ".join(export_command))
print("\n" + "="*60)

try:
    result = subprocess.run(
        export_command,
        capture_output=True,
        text=True,
        cwd=project_root
    )
    
    if result.returncode == 0:
        print("‚úÖ Export successful!")
        print("\nOutput (last 20 lines):")
        print("\n".join(result.stdout.split("\n")[-20:]))
    else:
        print("‚ùå Export failed!")
        print("Error output:")
        print(result.stderr)
except Exception as e:
    print(f"‚ùå Error running export: {e}")
    print("\nAlternative: You can run this command manually:")
    print(f"cd {project_root}")
    print(" ".join(export_command))

In [None]:
# Verify ONNX file was created
if ONNX_FILE.exists():
    file_size = ONNX_FILE.stat().st_size / (1024 * 1024)
    print(f"‚úÖ ONNX model exported: {ONNX_FILE}")
    print(f"   File size: {file_size:.2f} MB")
else:
    print(f"‚ùå ONNX file not found at {ONNX_FILE}")
    print("\nTrying alternative export method...")
    
    # Alternative: Direct Python export
    from modelexport.conversion.hf_universal_hierarchy_exporter import HuggingFaceUniversalHierarchyExporter
    
    exporter = HuggingFaceUniversalHierarchyExporter(
        model_name=MODEL_NAME,
        output_path=str(ONNX_FILE),
        strategy="htp",
        opset_version=14
    )
    
    success = exporter.export()
    if success:
        print("‚úÖ Alternative export successful!")
    else:
        print("‚ùå Alternative export also failed")

## Step 3: Copy HuggingFace Config Alongside ONNX Model

In [None]:
# Load and save HuggingFace config
print(f"üìã Loading HuggingFace config for {MODEL_NAME}...")

# Load the config from HuggingFace
hf_config = AutoConfig.from_pretrained(MODEL_NAME)

print(f"Model type: {hf_config.model_type}")
print(f"Hidden size: {hf_config.hidden_size}")
print(f"Num layers: {hf_config.num_hidden_layers}")
print(f"Vocab size: {hf_config.vocab_size}")

# Save config.json in the export directory
config_path = EXPORT_DIR / "config.json"
hf_config.save_pretrained(EXPORT_DIR)

print(f"\n‚úÖ Config saved to: {config_path}")

In [None]:
# Also save the tokenizer for convenience
print("üìù Saving tokenizer...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.save_pretrained(EXPORT_DIR)

print(f"‚úÖ Tokenizer saved to: {EXPORT_DIR}")

# List all files in export directory
print("\nüìÅ Files in export directory:")
for file in EXPORT_DIR.iterdir():
    if file.is_file():
        size = file.stat().st_size / 1024
        print(f"  - {file.name}: {size:.1f} KB")

## Step 4: Load ONNX Model with Optimum ORTModel

In [None]:
# Load the ONNX model using Optimum
print("üîß Loading ONNX model with Optimum...")

try:
    from optimum.onnxruntime import ORTModelForFeatureExtraction
    
    # Load the model - it will use config.json automatically
    ort_model = ORTModelForFeatureExtraction.from_pretrained(EXPORT_DIR)
    
    print("‚úÖ Model loaded successfully!")
    print(f"Model type: {type(ort_model)}")
    print(f"Config type: {type(ort_model.config)}")
    print(f"Config model type: {ort_model.config.model_type}")
    
except Exception as e:
    print(f"‚ùå Error loading with Optimum: {e}")
    print("\nTrying direct ONNX Runtime loading...")
    
    # Fallback: Direct ONNX Runtime
    import onnxruntime as ort
    
    session = ort.InferenceSession(str(ONNX_FILE))
    print("‚úÖ Loaded with ONNX Runtime directly")
    print(f"Input names: {[inp.name for inp in session.get_inputs()]}")
    print(f"Output names: {[out.name for out in session.get_outputs()]}")

## Step 5: Inference with Random Text Input

In [None]:
# Generate some random test texts
test_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "ONNX models provide efficient inference.",
    "HuggingFace Optimum makes deployment easy.",
    "This is a test of the BERT-tiny model exported with HTP.",
    "Machine learning models can be optimized for production."
]

print("üìù Test texts:")
for i, text in enumerate(test_texts, 1):
    print(f"{i}. {text}")

In [None]:
# Tokenize the texts
print("\nüî§ Tokenizing inputs...")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(EXPORT_DIR)

# Tokenize all texts
inputs = tokenizer(
    test_texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"  # Return PyTorch tensors
)

print(f"Input shape: {inputs['input_ids'].shape}")
print(f"Input keys: {list(inputs.keys())}")
print(f"\nFirst sequence tokens (first 20):")
print(inputs['input_ids'][0][:20].tolist())

In [None]:
# Run inference with Optimum ORTModel
print("\nüöÄ Running inference with Optimum ORTModel...")

try:
    # Run inference
    with torch.no_grad():
        outputs = ort_model(**inputs)
    
    print("‚úÖ Inference successful!")
    
    # Check output structure
    if hasattr(outputs, 'last_hidden_state'):
        hidden_states = outputs.last_hidden_state
        print(f"\nOutput shape: {hidden_states.shape}")
        print(f"Output type: {type(hidden_states)}")
        
        # Get sentence embeddings (mean pooling)
        attention_mask = inputs['attention_mask']
        mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_embeddings = torch.sum(hidden_states * mask_expanded, 1)
        sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
        embeddings = sum_embeddings / sum_mask
        
        print(f"\nSentence embeddings shape: {embeddings.shape}")
        print(f"Embedding dimension: {embeddings.shape[1]}")
        
        # Show first embedding (first few dimensions)
        print(f"\nFirst sentence embedding (first 10 dims):")
        print(embeddings[0][:10].numpy())
        
    else:
        print(f"\nOutput type: {type(outputs)}")
        if hasattr(outputs, 'logits'):
            print(f"Logits shape: {outputs.logits.shape}")
            
except Exception as e:
    print(f"‚ùå Error during inference: {e}")
    print("\nDebug info:")
    print(f"Model type: {type(ort_model) if 'ort_model' in locals() else 'Not loaded'}")
    print(f"Input types: {type(inputs)}")

## Step 6: Compare with Original PyTorch Model (Optional)

In [None]:
# Load original PyTorch model for comparison
print("üìä Loading original PyTorch model for comparison...")

try:
    pytorch_model = AutoModel.from_pretrained(MODEL_NAME)
    pytorch_model.eval()
    
    # Run inference with PyTorch model
    with torch.no_grad():
        pytorch_outputs = pytorch_model(**inputs)
    
    pytorch_hidden = pytorch_outputs.last_hidden_state
    
    # Compare outputs
    if 'hidden_states' in locals():
        diff = torch.abs(hidden_states - pytorch_hidden)
        max_diff = diff.max().item()
        mean_diff = diff.mean().item()
        
        print(f"\nüìà Comparison Results:")
        print(f"Max difference: {max_diff:.6f}")
        print(f"Mean difference: {mean_diff:.6f}")
        
        if max_diff < 1e-3:
            print("‚úÖ Excellent match! ONNX model is accurate.")
        elif max_diff < 1e-2:
            print("‚úÖ Good match! Minor numerical differences.")
        else:
            print("‚ö†Ô∏è Larger differences detected. May need investigation.")
            
except Exception as e:
    print(f"Could not load PyTorch model for comparison: {e}")

## Step 7: Performance Comparison

In [None]:
import time

def benchmark_model(model, inputs, num_runs=100, warmup=10):
    """Benchmark model inference speed."""
    # Warmup
    for _ in range(warmup):
        with torch.no_grad():
            _ = model(**inputs)
    
    # Benchmark
    start = time.perf_counter()
    for _ in range(num_runs):
        with torch.no_grad():
            _ = model(**inputs)
    end = time.perf_counter()
    
    avg_time = (end - start) / num_runs * 1000  # ms
    return avg_time

print("‚ö° Performance Benchmark")
print("="*40)

# Benchmark ONNX model
if 'ort_model' in locals():
    onnx_time = benchmark_model(ort_model, inputs)
    print(f"ONNX Runtime: {onnx_time:.2f} ms/batch")

# Benchmark PyTorch model
if 'pytorch_model' in locals():
    pytorch_time = benchmark_model(pytorch_model, inputs)
    print(f"PyTorch: {pytorch_time:.2f} ms/batch")
    
    if 'onnx_time' in locals():
        speedup = pytorch_time / onnx_time
        print(f"\nüöÄ Speedup: {speedup:.2f}x")
        if speedup > 1:
            print("‚úÖ ONNX is faster!")
        else:
            print("üìù PyTorch is faster (may be due to small model/batch size)")

## Summary

This notebook demonstrated:

1. **HTP Export**: Used modelexport CLI to export BERT-tiny with hierarchy preservation
2. **Config Setup**: Copied HuggingFace config.json alongside ONNX model
3. **Optimum Loading**: Loaded ONNX model using ORTModelForFeatureExtraction
4. **Inference**: Ran inference with tokenized text inputs
5. **Verification**: Compared outputs with original PyTorch model
6. **Performance**: Benchmarked ONNX vs PyTorch speed

### Key Insights:

- The `config.json` (HuggingFace config) is all that's needed for Optimum inference
- No OnnxConfig or ORTConfig required for basic inference
- ONNX models can provide significant speedup for inference
- The HTP export preserves model hierarchy in ONNX metadata

In [None]:
# Clean up (optional)
print("\nüßπ Cleanup")
print(f"Exported model is saved at: {EXPORT_DIR}")
print("You can keep it for future use or delete it.")

# Uncomment to delete
# import shutil
# shutil.rmtree(EXPORT_DIR)
# print("‚úÖ Cleaned up export directory")