<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/03-exploring-onnx/02_gpt_2_sm_onnx_conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ONNX Conversion of the GPT-2 Small Model


The code in this notebook is to introduce readers to the [ONNX](https://onnx.ai/) format and [ONNX Runtime](https://onnxruntime.ai/) on GPU with the [GPT-2 Small](https://huggingface.co/openai-community/gpt2) model. It requires hardware acceleration (GPU).

*** **Update September 2025: the code in this notebook isn't anymore compatible with PyTorch 2.1 or later and the HF's Transformers releases that support the latest PyTorch. We need then to downgrade PyTorch and the Transformers packages.** ***

In [None]:
!pip install torch==2.0.1 transformers==4.31.0

Install the missing requirements (only ONNX and the ONNX runtime for GPUs).

In [None]:
!pip install onnx onnxruntime-gpu

Download the GPT-2 Small model from the Hugging Face Hub and load it into the GPU memory.

In [3]:
import torch
from transformers import GPT2Tokenizer, AutoModelForCausalLM

model_id = 'openai-community/gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
device = torch.device("cuda")
model.eval().to(device)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

Verify that the downloaded model works as expected.

In [4]:
inputs = tokenizer("The story so far: in the beginning, the universe was created.", return_attention_mask=False, return_tensors="pt")
print("input tensors")
print(inputs.to(device))
print("input tensor shape")
print(inputs["input_ids"].size())

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
print("output tensor")
print(logits)
print("output shape")
print(logits.shape)

input tensors
{'input_ids': tensor([[ 464, 1621,  523, 1290,   25,  287,  262, 3726,   11,  262, 6881,  373,
         2727,   13]], device='cuda:0')}
input tensor shape
torch.Size([1, 14])
output tensor
tensor([[[ -36.2874,  -35.0114,  -38.0793,  ...,  -40.5163,  -41.3760,
           -34.9193],
         [ -96.0524,  -95.8698,  -99.3108,  ..., -103.6897, -103.3026,
           -96.3700],
         [ -72.0065,  -72.1456,  -76.6058,  ...,  -76.3842,  -73.5555,
           -72.9844],
         ...,
         [-115.4907, -115.5128, -119.5238,  ..., -124.6191, -118.0925,
          -117.5597],
         [-115.7825, -118.0826, -122.0568,  ..., -128.6989, -126.3214,
          -119.5271],
         [-152.4017, -152.6696, -153.9131,  ..., -164.9893, -163.4139,
          -146.6134]]], device='cuda:0')
output shape
torch.Size([1, 14, 50257])


## ONNX Conversion

Create a directory where to store the ONNX converted model.

In [5]:
import os

output_dir = os.path.join(".", "onnx_models")
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
export_model_path = os.path.join(output_dir, 'gpt-2.onnx')

Create an input tensor to be used for model conversion.

In [6]:
tokenized_inputs = tokenizer("The story so far: in the beginning, the universe was created.",
                             return_attention_mask=False,
                             return_tensors="pt")
tokenized_inputs.to(device)
inputs_sample = {
    'input_ids':  tokenized_inputs['input_ids']
}

Convert the model to ONNX.

In [12]:
with torch.no_grad():
  torch.onnx.export(model,
                    inputs_sample,
                    export_model_path,
                    export_params=True,
                    opset_version=15,
                    do_constant_folding=True,
                    input_names=['input_ids']
                    )

  torch.onnx.export(model,
  self.keys = torch.tensor([], dtype=self.dtype, device=self.device)
  self.values = torch.tensor([], dtype=self.dtype, device=self.device)


RuntimeError: Only tuples, lists and Variables are supported as JIT inputs/outputs. Dictionaries and strings are also accepted, but their usage is not recommended. Here, received an input of unsupported type: DynamicCache

Optimize the exported model.

In [None]:
from onnxruntime.transformers import optimizer

optimized_model_path = os.path.join(output_dir, 'gpt-2-onnx_opt_gpu.onnx')
optimized_model = optimizer.optimize_model(export_model_path,
                                           model_type='gpt2',
                                           use_gpu=True,
                                           num_heads=12,
                                           hidden_size=768,
                                           verbose=True)
optimized_model.save_model_to_file(optimized_model_path)

Benchmark inference with the original model.

In [None]:
import time

with torch.inference_mode():
    sample_output = model.generate(inputs.input_ids, max_length=64, pad_token_id=50256)
    print(tokenizer.decode(sample_output[0], skip_special_tokens=False))
    for _ in range(2):
        _ = model.generate(inputs.input_ids, max_length=64, pad_token_id=50256)
        torch.cuda.synchronize()
    start = time.time()
    for _ in range(10):
        _ = model.generate(inputs.input_ids, max_length=256, pad_token_id=50256)
        torch.cuda.synchronize()
    print(f"----\nPytorch: {(time.time() - start)/10:.2f}s/sequence")
_ = model.cpu()

Benchmark inference with the ONNX converted model.

In [None]:
import onnxruntime
import numpy

session = onnxruntime.InferenceSession(export_model_path, providers=["CUDAExecutionProvider"])
onnx_input_ids = tokenizer("The story so far: in the beginning, the universe was created.",
                           return_attention_mask=False,
                           return_tensors="np")
ort_inputs = {
    "input_ids": onnx_input_ids['input_ids']
}

for _ in range(2):
  ort_outputs = session.run(None, ort_inputs)
start = time.time()
for _ in range(10):
  ort_outputs = session.run(None, ort_inputs)
print(f"----\nPytorch: {(time.time() - start)/10:.2f}s/sequence")

Benchmark inference with the optimized ONNX model.

In [None]:
import onnxruntime
import numpy

opt_session = onnxruntime.InferenceSession(optimized_model_path, providers=["CUDAExecutionProvider"])
onnx_input_ids = tokenizer("The story so far: in the beginning, the universe was created.",
                           return_attention_mask=False,
                           return_tensors="np")
ort_inputs = {
    "input_ids": onnx_input_ids['input_ids']
}

for _ in range(2):
  ort_outputs = opt_session.run(None, ort_inputs)
start = time.time()
for _ in range(10):
  ort_outputs = opt_session.run(None, ort_inputs)
print(f"----\nPytorch: {(time.time() - start)/10:.2f}s/sequence")