# 自己定义重新整理下代码, 添加点注释

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
from transformers import BertTokenizer, BertForMaskedLM
import torch
import timeit
import numpy as np
import torch_tensorrt
import torch.backends.cudnn as cudnn

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# 加载分词器
enc = BertTokenizer.from_pretrained('bert-base-uncased')

定义下输入的形状, bert 模型有三个输入, 分别是

- input_ids: (batch_size, sequence_length)
- attention_mask: (batch_size, sequence_length)
- token_type_ids: (batch_size, sequence_length)

这里的名字和 transformers 的文档里是一致的.
https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForMaskedLM.forward

In [4]:
batch_size = 4

# 定义输入的值
batched_indexed_tokens = [[101, 64]*64]*batch_size
batched_segment_ids = [[0, 1]*64]*batch_size
batched_attention_masks = [[1, 1]*64]*batch_size

tokens_tensor = torch.tensor(batched_indexed_tokens)
segments_tensor = torch.tensor(batched_segment_ids)
attention_masks_tensor = torch.tensor(batched_attention_masks)

加载模型. 加载完模型后切换到 eval 状态. 注意, torchscript 需要设置为 True.

第二步是使用 torch.jit.trace 将模型转换为 torchscript 的模型. torch-tensorrt 支持两种前端, 一种是 torchscript, 另一种是 FX.

In [5]:
mlm_model_ts = BertForMaskedLM.from_pretrained('bert-base-uncased', torchscript=True).eval()
traced_mlm_model = torch.jit.trace(mlm_model_ts, [tokens_tensor, segments_tensor, attention_masks_tensor])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
# 准备下输入, 让模型推理下 [MASK] 的值应该填什么
masked_sentences = ['Paris is the [MASK] of France.', 
                    'The primary [MASK] of the United States is English.', 
                    'A baseball game consists of at least nine [MASK].', 
                    'Topology is a branch of [MASK] concerned with the properties of geometric objects that remain unchanged under continuous transformations.']
pos_masks = [4, 3, 9, 6]

先在原始模型 mlm_model_ts 上验证下结果

In [7]:
encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
outputs = mlm_model_ts(**encoded_inputs)
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


然后在转换后的模型 traced_mlm_model 上验证推理结果

In [8]:
encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
outputs = traced_mlm_model(**encoded_inputs)
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


可以看到结果是一致的

In [9]:
new_level = torch_tensorrt.logging.Level.Error
torch_tensorrt.logging.set_reportable_log_level(new_level)

要让一个模型能使用 tensorrt, 需要预先编译, 也就是使用 torch_tensorrt.compile.

In [10]:
trt_model = torch_tensorrt.compile(
    # 指定模型
    traced_mlm_model, 
    # 指定模型的输入
    inputs= [torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32)], # attention_mask
    # 指定模型输入的类型
    enabled_precisions= {torch.float32}, # Run with 32-bit precision
    # 不理解, 文档上是说 Maximum size of workspace given to TensorRT
    workspace_size=2000000000,
    # 将 int64 和 float64 的权重截断为 int32 和 float32
    truncate_long_and_double=True
)

验证编译后的模型的推理结果. 注意, 这个时候不能用关键字参数了, 只能按照顺序传递参数.
另外, tensorrt 的模型只能在 CUDA GPU 上推理, 所以需要将输入的 device 变成 cuda.

In [14]:
enc_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
# 将输入参数推送到 cuda 上
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}
# 按顺序传递参数
output_trt = trt_model(enc_inputs['input_ids'], enc_inputs['attention_mask'], enc_inputs['token_type_ids'])
most_likely_token_ids_trt = [torch.argmax(output_trt[i, pos, :]) for i, pos in enumerate(pos_masks)] 
unmasked_tokens_trt = enc.decode(most_likely_token_ids_trt).split(' ')
unmasked_sentences_trt = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens_trt)]
for sentence in unmasked_sentences_trt:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


加载一个 16 精度的模型, 然后验证它的推理结果.

In [15]:
mlm_model_ts_half = BertForMaskedLM.from_pretrained('bert-base-uncased', torchscript=True).half().eval().cuda()

encoded_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
encoded_inputs = {k: v.type(torch.int32).cuda() for k, v in encoded_inputs.items()}
outputs = mlm_model_ts_half(**encoded_inputs)
most_likely_token_ids = [torch.argmax(outputs[0][i, pos, :]) for i, pos in enumerate(pos_masks)]
unmasked_tokens = enc.decode(most_likely_token_ids).split(' ')
unmasked_sentences = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens)]
for sentence in unmasked_sentences:
    print(sentence)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of mathematics concerned with the properties of geometric objects that remain unchanged under continuous transformations.


将模型转换为 torchscript 形式, 然后编译成 tensorrt 模型.

In [16]:
traced_mlm_model_half = torch.jit.trace(mlm_model_ts_half, [tokens_tensor.cuda(), segments_tensor.cuda(), attention_masks_tensor.cuda()])

trt_model_fp16 = torch_tensorrt.compile(
    traced_mlm_model_half, 
    inputs= [torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),  # input_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32),  # token_type_ids
             torch_tensorrt.Input(shape=[batch_size, 128], dtype=torch.int32)], # attention_mask
    # 这里使用半精度, 也就是 16精度.
    enabled_precisions= {torch.half}, # Run with 16-bit precision
    workspace_size=2000000000,
    truncate_long_and_double=True
)

同样验证下推理结果.

In [17]:
enc_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}
output_trt = trt_model_fp16(enc_inputs['input_ids'], enc_inputs['attention_mask'], enc_inputs['token_type_ids'])
most_likely_token_ids_trt = [torch.argmax(output_trt[i, pos, :]) for i, pos in enumerate(pos_masks)] 
unmasked_tokens_trt = enc.decode(most_likely_token_ids_trt).split(' ')
unmasked_sentences_trt = [masked_sentences[i].replace('[MASK]', token) for i, token in enumerate(unmasked_tokens_trt)]
for sentence in unmasked_sentences_trt:
    print(sentence)

Paris is the capital of France.
The primary language of the United States is English.
A baseball game consists of at least nine innings.
Topology is a branch of science concerned with the properties of geometric objects that remain unchanged under continuous transformations.


# 验证模型推理速度

In [19]:
def timeGraph(model, input_tensor1, input_tensor2, input_tensor3, num_loops=50):
    print("Warm up ...")
    with torch.no_grad():
        for _ in range(20):
            features = model(input_tensor1, input_tensor2, input_tensor3)

    torch.cuda.synchronize()

    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(num_loops):
            start_time = timeit.default_timer()
            features = model(input_tensor1, input_tensor2, input_tensor3)
            torch.cuda.synchronize()
            end_time = timeit.default_timer()
            timings.append(end_time - start_time)
            # print("Iteration {}: {:.6f} s".format(i, end_time - start_time))

    return timings

In [20]:
def printStats(graphName, timings, batch_size):
    times = np.array(timings)
    steps = len(times)
    speeds = batch_size / times
    time_mean = np.mean(times)
    time_med = np.median(times)
    time_99th = np.percentile(times, 99)
    time_std = np.std(times, ddof=0)
    speed_mean = np.mean(speeds)
    speed_med = np.median(speeds)

    msg = ("\n%s =================================\n"
            "batch size=%d, num iterations=%d\n"
            "  Median text batches/second: %.1f, mean: %.1f\n"
            "  Median latency: %.6f, mean: %.6f, 99th_p: %.6f, std_dev: %.6f\n"
            ) % (graphName,
                batch_size, steps,
                speed_med, speed_mean,
                time_med, time_mean, time_99th, time_std)
    print(msg)

In [26]:
cudnn.benchmark = True

# 准备下输入, 输入都是相同的
enc_inputs = enc(masked_sentences, return_tensors='pt', padding='max_length', max_length=128)
enc_inputs = {k: v.type(torch.int32).cuda() for k, v in enc_inputs.items()}

最原始的模型 mlm_model_ts

In [27]:
timings = timeGraph(mlm_model_ts.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 500.8, mean: 503.9
  Median latency: 0.007987, mean: 0.008025, 99th_p: 0.010825, std_dev: 0.000874



转换成 torchscript 之后的模型 traced_mlm_model

In [23]:
timings = timeGraph(traced_mlm_model.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 491.6, mean: 492.3
  Median latency: 0.008137, mean: 0.008247, 99th_p: 0.010523, std_dev: 0.001002



原始的 16 精度模型 mlm_model_ts_half

In [24]:
timings = timeGraph(mlm_model_ts_half.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 652.1, mean: 627.2
  Median latency: 0.006134, mean: 0.006622, 99th_p: 0.010615, std_dev: 0.001427



转换成 torchscript 之后的16精度模型 traced_mlm_model_half

In [25]:
timings = timeGraph(traced_mlm_model_half.cuda(), enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 769.4, mean: 768.3
  Median latency: 0.005199, mean: 0.005520, 99th_p: 0.009661, std_dev: 0.001472



重头戏来了, 使用 tensorrt 编译后的模型 trt_model

In [28]:
timings = timeGraph(trt_model, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 586.4, mean: 602.8
  Median latency: 0.006821, mean: 0.006729, 99th_p: 0.008102, std_dev: 0.000779



使用 tensorrt 编译后的 16 精度模型 trt_model_fp16

In [32]:
timings = timeGraph(trt_model_fp16, enc_inputs['input_ids'], enc_inputs['token_type_ids'], enc_inputs['attention_mask'])

printStats("BERT", timings, batch_size)

Warm up ...
Start timing ...

batch size=4, num iterations=50
  Median text batches/second: 1141.7, mean: 1087.3
  Median latency: 0.003504, mean: 0.003793, 99th_p: 0.005308, std_dev: 0.000689



# 最后总结

和 [原始文档](https://github.com/pytorch/TensorRT/blob/master/notebooks/Hugging-Face-BERT.ipynb) 里的数据比较起来, 我这里的提升还是不太明显的.
尤其是在使用 float32 输入的时候, 仅从 500 提升到了 600, 只有 20% 的提升.
而且原文中使用 Traced 模型之后也有很大的提升, 在我这里没提升, 从 5000 到 490, 反而低了一点点.
使用 16 精度的提升更明显些, 最终达成了 1000.

原文的数据是 Scripted (GPU): 1.0x Traced (GPU): 1.62x Torch-TensorRT (FP32): 2.14x Torch-TensorRT (FP16): 3.15x
设备是 NVIDIA A100 GPU.

我的设备是 3090. 代码是在同样的镜像上跑的, nvcr.io/nvidia/pytorch:22.05-py3.