<a href="https://colab.research.google.com/github/starminalush/mlops_report/blob/main/Copy_of_onnx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Устанавливаем нужные зависимости

In [1]:
!pip install onnx transformers onnxruntime folium==0.2.1 optimum[onnxruntime]

Collecting onnx
  Downloading onnx-1.11.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 11.5 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 41.0 MB/s 
[?25hCollecting onnxruntime
  Downloading onnxruntime-1.11.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[K     |████████████████████████████████| 5.2 MB 44.2 MB/s 
[?25hCollecting folium==0.2.1
  Downloading folium-0.2.1.tar.gz (69 kB)
[K     |████████████████████████████████| 69 kB 5.5 MB/s 
[?25hCollecting optimum[onnxruntime]
  Downloading optimum-1.1.0.tar.gz (62 kB)
[K     |████████████████████████████████| 62 kB 723 kB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.1.0
  Downlo

Фиксируем версии библиотек

In [2]:
!pip freeze > req.txt

Импорты

In [51]:
import torch
from transformers import AutoModelForSequenceClassification
from transformers import BertTokenizerFast
from transformers.onnx import export
from pathlib import Path
from typing import Mapping, OrderedDict
from transformers.onnx import OnnxConfig
from transformers import AutoConfig
import onnxruntime as nxrun
import onnx
import numpy as np
from optimum.onnxruntime.configuration import AutoQuantizationConfig
from torch.nn.utils import prune
from optimum.onnxruntime import ORTQuantizer

Запускаем rubert как есть

In [35]:
tokenizer = BertTokenizerFast.from_pretrained('blanchefort/rubert-base-cased-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('blanchefort/rubert-base-cased-sentiment', return_dict=True)

@torch.no_grad()
def predict(text):
    inputs = tokenizer(text, max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**inputs)
    predicted = torch.nn.functional.softmax(outputs.logits, dim=1)
    predicted = torch.argmax(predicted, dim=1).numpy()
    return predicted[0]

In [None]:
%%time
predict('Правительство выделит 16 миллиардов рублей на поддержку клещей')
#вернулся нейтральный класс

CPU times: user 128 ms, sys: 46 µs, total: 128 ms
Wall time: 142 ms


0

Переводим в ONNX
1. Есть библиотека transforms, где все из коробки

https://huggingface.co/docs/transformers/serialization - делаем все по лучшим гайдам

In [None]:
class DistilBertOnnxConfig(OnnxConfig):
    @property
    def inputs(self) -> Mapping[str, Mapping[int, str]]:
        return OrderedDict(
            [
                ("input_ids", {0: "batch", 1: "sequence"}),
                ("attention_mask", {0: "batch", 1: "sequence"}),
                ("token_type_ids", {0: "batch", 1: "sequence"}),
            ]
        )

In [None]:
config = AutoConfig.from_pretrained("blanchefort/rubert-base-cased-sentiment")
onnx_config_for_seq_clf = DistilBertOnnxConfig(config, task="sequence-classification")
print(onnx_config_for_seq_clf.outputs)

OrderedDict([('logits', {0: 'batch'})])


In [None]:
!mkdir -p output/onnx_transforms

In [None]:
onnx_inputs, onnx_outputs = export(
        tokenizer,
        model,
        onnx_config_for_seq_clf,
        output=Path("output/onnx_transforms/rubert.onnx"),
        opset=11)

Пробуем запустить в ONNX и посмотреть время инференса

In [None]:
sess_options = nxrun.SessionOptions()
providers = [
    'CPUExecutionProvider'
]

model_ONNX = nxrun.InferenceSession("output/onnx_transforms/rubert.onnx", sess_options, providers)

Смотрим на входы и выходы импортированной модели

In [None]:
input_names = model_ONNX.get_inputs()
for input_name in input_names:
  print(input_name.name)

input_ids
attention_mask
token_type_ids


In [None]:
for on in model_ONNX.get_outputs():
  print(on.name)

logits


In [9]:
def predict_onnx(text):
  inputs = tokenizer(text, max_length=512, padding=True, truncation=True, return_tensors='np')
  result  = model_ONNX.run(None, dict(inputs))
  predicted = result.index(max(result))
  return predicted

In [None]:
%%time
predict_onnx('Правительство выделит 16 миллиардов рублей на поддержку клещей')

CPU times: user 56.3 ms, sys: 0 ns, total: 56.3 ms
Wall time: 58.7 ms


0

Квантизация 

Делаем тоже по лучшим гайдам https://github.com/huggingface/optimum

Есть три вида квантизации - статическая, динамическая и Quantization-Aware-Training(QAT)

Динамическая квантизация не требует ничего, поэтому она самая простая

In [12]:
!mkdir -p output/quantization

In [13]:
model_checkpoint = "blanchefort/rubert-base-cased-sentiment"
# The type of quantization to apply
qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
quantizer = ORTQuantizer.from_pretrained(model_checkpoint, feature="sequence-classification")

# Quantize the model!
quantizer.export(
    onnx_model_path="output/quantization/rubert.onnx",
    onnx_quantized_model_output_path="output/quantization/rubert-dyn-quantized.onnx",
    quantization_config=qconfig,
)

PosixPath('output/quantization/rubert-dyn-quantized.onnx')

Пробуем запустить динамечески квантизированную ONNX модель и посмотреть на время инференса

In [14]:
sess_options = nxrun.SessionOptions()
providers = [
    'CPUExecutionProvider'
]

model_ONNX = nxrun.InferenceSession("/content/output/quantization/rubert-dyn-quantized.onnx", sess_options, providers)

In [None]:
%%time
predict_onnx('Правительство выделит 16 миллиардов рублей на поддержку клещей')

CPU times: user 31.3 ms, sys: 4.81 ms, total: 36.1 ms
Wall time: 37.9 ms


0

Прунинг модели - и снова лучшие мануалы интернета https://github.com/Huffon/nlp-various-tutorials/blob/master/pruning-bert.ipynb

In [40]:
model.bert.encoder.layer

ModuleList(
  (0): BertLayer(
    (attention): BertAttention(
      (self): BertSelfAttention(
        (query): Linear(in_features=768, out_features=768, bias=True)
        (key): Linear(in_features=768, out_features=768, bias=True)
        (value): Linear(in_features=768, out_features=768, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (output): BertSelfOutput(
        (dense): Linear(in_features=768, out_features=768, bias=True)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (intermediate): BertIntermediate(
      (dense): Linear(in_features=768, out_features=3072, bias=True)
      (intermediate_act_fn): GELUActivation()
    )
    (output): BertOutput(
      (dense): Linear(in_features=3072, out_features=768, bias=True)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
  )
  (1): BertLayer

Запруним энкодер слои

In [41]:
final_model = model

parameters_to_prune = ()
for i in range(12):
    parameters_to_prune += (
        (final_model.bert.encoder.layer[i].attention.self.key, 'weight'),
        (final_model.bert.encoder.layer[i].attention.self.query, 'weight'),
        (final_model.bert.encoder.layer[i].attention.self.value, 'weight'),
    )

prune.global_unstructured(
    parameters_to_prune,
    pruning_method=prune.L1Unstructured,
    amount=0.2,
)

In [43]:
for i in range(12):
    print(
        "Sparsity in Layer {}-th key weight: {:.2f}%".format(
            i+1,
            100. * float(torch.sum(final_model.bert.encoder.layer[i].attention.self.key.weight == 0))
            / float(final_model.bert.encoder.layer[i].attention.self.key.weight.nelement())
        )
    )
    print(
        "Sparsity in Layer {}-th query weightt: {:.2f}%".format(
            i+1,
            100. * float(torch.sum(final_model.bert.encoder.layer[i].attention.self.query.weight == 0))
            / float(final_model.bert.encoder.layer[i].attention.self.query.weight.nelement())
        )
    )
    print(
        "Sparsity in Layer {}-th value weight: {:.2f}%".format(
            i+1,
            100. * float(torch.sum(final_model.bert.encoder.layer[i].attention.self.value.weight == 0))
            / float(final_model.bert.encoder.layer[i].attention.self.value.weight.nelement())
        )
    )
    print()

    
numerator, denominator = 0, 0
for i in range(12):
    numerator += torch.sum(final_model.bert.encoder.layer[i].attention.self.key.weight == 0)
    numerator += torch.sum(final_model.bert.encoder.layer[i].attention.self.query.weight == 0)
    numerator += torch.sum(final_model.bert.encoder.layer[i].attention.self.value.weight == 0)

    denominator += final_model.bert.encoder.layer[i].attention.self.key.weight.nelement()
    denominator += final_model.bert.encoder.layer[i].attention.self.query.weight.nelement()
    denominator += final_model.bert.encoder.layer[i].attention.self.value.weight.nelement()
    
print("Global sparsity: {:.2f}%".format(100. * float(numerator) / float(denominator)))

Sparsity in Layer 1-th key weight: 18.59%
Sparsity in Layer 1-th query weightt: 18.67%
Sparsity in Layer 1-th value weight: 26.79%

Sparsity in Layer 2-th key weight: 18.77%
Sparsity in Layer 2-th query weightt: 18.33%
Sparsity in Layer 2-th value weight: 25.69%

Sparsity in Layer 3-th key weight: 20.08%
Sparsity in Layer 3-th query weightt: 19.58%
Sparsity in Layer 3-th value weight: 23.53%

Sparsity in Layer 4-th key weight: 18.77%
Sparsity in Layer 4-th query weightt: 18.49%
Sparsity in Layer 4-th value weight: 24.32%

Sparsity in Layer 5-th key weight: 18.40%
Sparsity in Layer 5-th query weightt: 18.36%
Sparsity in Layer 5-th value weight: 23.00%

Sparsity in Layer 6-th key weight: 18.32%
Sparsity in Layer 6-th query weightt: 17.84%
Sparsity in Layer 6-th value weight: 21.55%

Sparsity in Layer 7-th key weight: 18.54%
Sparsity in Layer 7-th query weightt: 18.06%
Sparsity in Layer 7-th value weight: 22.07%

Sparsity in Layer 8-th key weight: 18.60%
Sparsity in Layer 8-th query weigh

Предиктим на запруненной модели

In [44]:
@torch.no_grad()
def predict(text):
    inputs = tokenizer(text, max_length=512, padding=True, truncation=True, return_tensors='pt')
    outputs = final_model(**inputs)
    predicted = torch.nn.functional.softmax(outputs.logits, dim=1)
    predicted = torch.argmax(predicted, dim=1).numpy()
    return predicted[0]

In [45]:
%%time
predict('Правительство выделит 16 миллиардов рублей на поддержку клещей')
#вернулся нейтральный класс

CPU times: user 120 ms, sys: 1.11 ms, total: 121 ms
Wall time: 134 ms


0

In [46]:
!mkdir -p output/pruning_quantization

In [48]:
torch.save(final_model, 'output/pruning_quantization/pruned_bert.pt')