参考来源

- [Inference PyTorch Bert Model with ONNX Runtime on GPU](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/notebooks/PyTorch_Bert-Squad_OnnxRuntime_GPU.ipynb)
- [transformers to onnx](https://huggingface.co/docs/transformers/v4.25.1/en/serialization#export-to-onnx)

首先必须安装依赖, onnxruntime 的 python 包也是分为 CPU 版和 GPU 版的.

- onnxruntime
- onnxruntime-gpu

In [1]:
!pip install onnxruntime-gpu

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


检查 onnxruntime 环境已经安装正确

In [2]:
import onnxruntime
print(onnxruntime.__version__)
print(onnxruntime.get_device())
print(onnxruntime.get_available_providers())

1.12.0
GPU
['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']


同样的, 这次也是使用 BertForMaskedLM 模型

In [30]:
import torch
import numpy
from transformers import BertTokenizer
enc = BertTokenizer.from_pretrained('bert-base-uncased')

masked_sentences = ['Paris is the [MASK] of France.', 
                    'The primary [MASK] of the United States is English.', 
                    'A baseball game consists of at least nine [MASK].', 
                    'Topology is a branch of [MASK] concerned with the properties of geometric objects that remain unchanged under continuous transformations.']
pos_masks = [4, 3, 9, 6]

inputs = enc(masked_sentences, return_tensors="np", padding='max_length', max_length=128)
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [25]:
from transformers import BertForMaskedLM
origin_model = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# 转换成 ONNX 模型

可以直接使用 transformers.onnx 这个命令行转换模型, 我这里使用了特性头 `--feature=masked-lm`, 因为要和 BertForMaskedLM 类保持一致

In [5]:
# 本地转换模型还是有点报错的, 输出里提到绝对误差超过了 1e-5
!python -m transformers.onnx --model=bert-base-uncased --feature=masked-lm onnx/

Framework not requested. Using torch to export to ONNX.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using framework PyTorch: 1.12.0a0+8a1a93a
Overriding 1 configuration item(s)
	- use_cache -> False
Validating ONNX model...
	-[✓] ONNX model output names match reference model ({'logits'})
	- Validating ONNX Model output "logits":
		-[✓] (3, 9, 30522) matches (3, 9, 30522)
		-[x] values not close enough (atol:

## TODO: 使用 torch.onnx.export 转换模型

# 加载 ONNX 模型

In [6]:
from onnxruntime import InferenceSession

# 加载 ONNX 模型
session = InferenceSession("onnx/model.onnx", providers=["CUDAExecutionProvider"])

In [23]:
print("输入:")
print([x.name for x in session.get_inputs()])
print([x.shape for x in session.get_inputs()])
print([x.type for x in session.get_inputs()])

print("输出:")
print([x.name for x in session.get_outputs()])
print([x.shape for x in session.get_outputs()])
print([x.type for x in session.get_outputs()])


输入:
['input_ids', 'attention_mask', 'token_type_ids']
[['batch', 'sequence'], ['batch', 'sequence'], ['batch', 'sequence']]
['tensor(int64)', 'tensor(int64)', 'tensor(int64)']
输出:
['logits']
[['batch', 'sequence', 30522]]
['tensor(float)']


In [12]:
# 进行推理, 推理时注意, 模型的输入是 numpy array 类型
outputs = session.run(output_names=["logits"], input_feed=dict(inputs))
outputs[0].shape

(4, 128, 30522)

In [24]:
most_likely_token_ids = [numpy.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
print(most_likely_token_ids)
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

[3007, 2653, 7202, 5597]
Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


In [31]:
# 和原始模型对照下
inputs_pt = enc(masked_sentences, return_tensors="pt", padding='max_length', max_length=128)
outputs = origin_model(**inputs_pt)

most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
print(most_likely_token_ids)
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

[tensor(3007), tensor(2653), tensor(7202), tensor(5597)]
Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.
