<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/03-exploring-onnx/01_bert_onnx_conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ONNX Conversion of the BERT Base Uncased Model


  
The code in this notebook is to introduce readers to the [ONNX](https://onnx.ai/) format and [ONNX Runtime](https://onnxruntime.ai/) with the [BERT Base Uncased](https://huggingface.co/google-bert/bert-base-uncased) model. It can be executed in the Colab free tier with hardware acceleration (GPU).  

### Settings

Install the missing requirements in the Colab VM (ONNX, the ONNX runtime and the HF's Datasets).

In [None]:
!pip install -q onnx onnxruntime datasets

Download the BERT Base Uncased model (and associated tokenizer) from the Hugging Face Hub.

In [2]:
from transformers import AutoModelForQuestionAnswering, BertTokenizer

model_id = 'google-bert/bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_id)
model = AutoModelForQuestionAnswering.from_pretrained(model_id)
model.eval()

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

Download a subset of the SQuAD dataset from the Hugging Face Hub.

In [7]:
from datasets import load_dataset

samples_count = 200
squad = load_dataset("squad", split="validation[:"+ str(samples_count) +"]")

Display one test sample.

In [4]:
squad[0]

{'id': '56be4db0acb8001400a502ec',
 'title': 'Super_Bowl_50',
 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.',
 'question': 'Which NFL team represented the AFC at Super Bowl 50?',
 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],


Benchmark the original model on the selected subset of the squad test set.

In [5]:
import time
import torch

max_seq_length = 128
# Measure the latency.
latency = []
with torch.no_grad():
    for i in range(samples_count):
        inputs = tokenizer(squad["question"][i], squad["context"][i], return_tensors="pt")
        start = time.time()
        outputs = model(**inputs)
        latency.append(time.time() - start)
print("PyTorch {} Average inference time = {} ms".format('CPU', format(sum(latency) * 1000 / len(latency), '.2f')))

PyTorch CPU Average inference time = 710.70 ms


### Convert the model to ONNX.

Create the directory to host the converted model.

In [8]:
import os

output_dir = os.path.join(".", "onnx_models")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
export_model_path = os.path.join(output_dir, 'bert-base-uncased.onnx')

Pick up one sample from the test dataset.

In [9]:
tokenized_inputs = tokenizer(squad["question"][0], squad["context"][0], return_tensors="pt")
inputs = {
        'input_ids':  tokenized_inputs['input_ids'],
        'input_mask': tokenized_inputs['attention_mask'],
        'segment_ids': tokenized_inputs['token_type_ids']
    }

Export the model to ONNX.

In [10]:
with torch.no_grad():
    symbolic_names = {0: 'batch_size', 1: 'max_seq_len'}
    torch.onnx.export(model,
                      args=tuple(inputs.values()),
                      f=export_model_path,
                      opset_version=15,
                      do_constant_folding=True,
                      input_names=['input_ids',
                                   'input_mask',
                                   'segment_ids'],
                      output_names=['start', 'end'],
                      dynamic_axes={'input_ids': symbolic_names,
                                    'input_mask' : symbolic_names,
                                    'segment_ids' : symbolic_names,
                                    'start' : symbolic_names,
                                    'end' : symbolic_names})
    print("Model exported at ", export_model_path)

Model exported at  ./onnx_models/bert-base-uncased.onnx


Validate the exported model.

In [11]:
from onnx.checker import check_model

check_model(export_model_path, full_check=True)

Benchmark the exported model (CPUExecutionProvider).

In [12]:
import onnxruntime
import numpy

sess_options = onnxruntime.SessionOptions()

sess_options.optimized_model_filepath = os.path.join(output_dir, "bert-base-uncased.onnx")

session = onnxruntime.InferenceSession(export_model_path, sess_options, providers=['CPUExecutionProvider'])

In [13]:
latency = []
for i in range(samples_count):
    full_inputs = tokenizer(squad["question"][i], squad["context"][i], return_tensors="np")
    ort_inputs = {
        'input_ids':  full_inputs['input_ids'],
        'input_mask': full_inputs['attention_mask'],
        'segment_ids': full_inputs['token_type_ids']
    }
    start = time.time()
    ort_outputs = session.run(None, ort_inputs)
    latency.append(time.time() - start)
print("OnnxRuntime cpu Average inference time = {} ms".format(format(sum(latency) * 1000 / len(latency), '.2f')))

OnnxRuntime cpu Average inference time = 382.42 ms


Verify correctess of the exported model.

In [14]:
print("***** Verifying correctness *****")
sample_range = 2
for i in range(sample_range):
    print('PyTorch and ONNX Runtime output {} are close:'.format(i), numpy.allclose(ort_outputs[i], outputs[i].cpu(), rtol=1e-05, atol=1e-04))

***** Verifying correctness *****
PyTorch and ONNX Runtime output 0 are close: True
PyTorch and ONNX Runtime output 1 are close: True


### Model Optimization

Optimize the exported model.

In [15]:
from onnxruntime.transformers import optimizer

optimized_model_path = os.path.join(output_dir, 'bert-base-uncased.onnx_opt_cpu.onnx')
optimized_model = optimizer.optimize_model(export_model_path, model_type='bert', num_heads=12, hidden_size=768)
optimized_model.save_model_to_file(optimized_model_path)

Benchmark the optimized model (CPUExecutionProvider).

In [16]:
sess_options_opt = onnxruntime.SessionOptions()

sess_options_opt.optimized_model_filepath = os.path.join(output_dir, "bert-base-uncased.onnx_opt_cpu.onnx")

session_opt = onnxruntime.InferenceSession(export_model_path, sess_options_opt, providers=['CPUExecutionProvider'])

In [17]:
latency_opt = []
for i in range(samples_count):
    full_inputs = tokenizer(squad["question"][i], squad["context"][i], return_tensors="np")
    ort_inputs = {
        'input_ids':  full_inputs['input_ids'],
        'input_mask': full_inputs['attention_mask'],
        'segment_ids': full_inputs['token_type_ids']
    }
    start = time.time()
    ort_outputs = session_opt.run(None, ort_inputs)
    latency_opt.append(time.time() - start)
print("OnnxRuntime cpu Average inference time = {} ms".format(format(sum(latency_opt) * 1000 / len(latency_opt), '.2f')))

OnnxRuntime cpu Average inference time = 352.36 ms
