# Ranking Transformer ONNX model export

This notebook demonstrates export of a three different Transformer models which we import to Vespa.ai for 
online serving. 

In [1]:
from transformers import AutoModel, AutoTokenizer, BertTokenizer, BertPreTrainedModel, BertModel
import transformers
import torch 
from pathlib import Path
import torch.nn as nn

# Sentence Transformer (bi-encoder) for dense retrieval 

We create a wrapper model so that we can compute the mean pooling over the output inside ONNX. Almost all sentence-transformer models uses mean pooling. We also perform unit length normalization so we instead of angular we can use use raw inner dot product which speeds up nearest neighbor search.  

In [2]:
class MeanPoolingEncoderONNX(BertPreTrainedModel):

    def __init__(self,config):
        super().__init__(config)
        self.bert = BertModel(config)
        self.init_weights()
        
    def forward(self, input_ids, attention_mask, token_type_ids=None):
        token_embeddings = self.bert(input_ids,attention_mask=attention_mask)[0]
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        sum_embeddings = sum_embeddings / sum_mask
        return torch.nn.functional.normalize(sum_embeddings, p=2, dim=1)

In [3]:
encoder = MeanPoolingEncoderONNX.from_pretrained("sentence-transformers/msmarco-MiniLM-L-6-v3")
tokenizer = BertTokenizer.from_pretrained("sentence-transformers/msmarco-MiniLM-L-6-v3")
encoder = encoder.eval()
pipeline = transformers.Pipeline(model=encoder, tokenizer=tokenizer)
import transformers.convert_graph_to_onnx as onnx_convert
onnx_convert.convert_pytorch(pipeline, opset=11, output=Path("sentence-msmarco-MiniLM-L-6-v3.onnx"), use_external_format=False)

Using framework PyTorch: 1.7.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
token_embeddings is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']


  position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
  assert all(


In [4]:
onnx_convert.quantize(Path("sentence-msmarco-MiniLM-L-6-v3.onnx"))

As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.
Quantized model has been written at sentence-msmarco-MiniLM-L-6-v3-quantized.onnx: ✔


PosixPath('sentence-msmarco-MiniLM-L-6-v3-quantized.onnx')

## Vespa ColBERT model (Late interaction model)

In [5]:
class VespaColBERT(BertPreTrainedModel):

    def __init__(self,config):
        super().__init__(config)
        self.bert = BertModel(config)
        self.linear = nn.Linear(config.hidden_size, 32, bias=False)
        self.init_weights()

    def forward(self, input_ids, attention_mask):
        Q = self.bert(input_ids,attention_mask=attention_mask)[0]
        Q = self.linear(Q)
        return torch.nn.functional.normalize(Q, p=2, dim=2)  

In [6]:
colbert_query_encoder = VespaColBERT.from_pretrained("vespa-engine/col-minilm") 
input_names = ["input_ids", "attention_mask"]
output_names = ["contextual"]
#input, max 32 query term
input_ids = torch.ones(1,32, dtype=torch.int64)
attention_mask = torch.ones(1,32,dtype=torch.int64)
args = (input_ids, attention_mask)
torch.onnx.export(colbert_query_encoder,
                args=args,
                f="vespa-colMiniLM-L-6.onnx",
                input_names = input_names,
                output_names = output_names,
                dynamic_axes = {
                    "input_ids": {0: "batch"},
                    "attention_mask": {0: "batch"},
                    "contextual": {0: "batch"},
                },
                opset_version=11)

In [7]:
onnx_convert.quantize(Path("vespa-colMiniLM-L-6.onnx"))

As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.
Quantized model has been written at vespa-colMiniLM-L-6-quantized.onnx: ✔


PosixPath('vespa-colMiniLM-L-6-quantized.onnx')

## Cross Attention Model 

In [8]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
cross_model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
output_file = "ms-marco-MiniLM-L-6-v2.onnx"
tokenizer = AutoTokenizer.from_pretrained(cross_model)
model = AutoModelForSequenceClassification.from_pretrained(cross_model)
model = model.eval()
pipeline = transformers.Pipeline(model=model, tokenizer=tokenizer)
onnx_convert.convert_pytorch(pipeline, opset=11, output=Path(output_file), use_external_format=False)
onnx_convert.quantize(Path(output_file))

Using framework PyTorch: 1.7.1
Found input input_ids with shape: {0: 'batch', 1: 'sequence'}
Found input token_type_ids with shape: {0: 'batch', 1: 'sequence'}
Found input attention_mask with shape: {0: 'batch', 1: 'sequence'}
Found output output_0 with shape: {0: 'batch'}
Ensuring inputs are in correct order
position_ids is not present in the generated input list.
Generated inputs order: ['input_ids', 'attention_mask', 'token_type_ids']
As of onnxruntime 1.4.0, models larger than 2GB will fail to quantize due to protobuf constraint.
This limitation will be removed in the next release of onnxruntime.
         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.
Quantized model has been written at ms-marco-MiniLM-L-6-v2-quantized.onnx: ✔


PosixPath('ms-marco-MiniLM-L-6-v2-quantized.onnx')