# Jointly prune, quantize and distill a Hugging Face Text Classification Model with OpenVINO

This notebook shows how to quantize a text classification model with OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). 

TODO

With quantization, we reduce the precision of the model's weights and activations from floating point (FP32) to integer (INT8). This results in a smaller model with faster inference times with OpenVINO Runtime. 

The notebook demonstrates post-training quantization, which does not require specific hardware to execute. A laptop or desktop with a recent Intel Core processor is recommended for best results. To install the requirements for this notebook, please do `pip install -r requirements.txt`.

In [1]:
import os
import time
import warnings
from pathlib import Path
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

import datasets
import evaluate
import numpy as np
import pandas as pd
import transformers
from evaluate import evaluator
from openvino.runtime import Core
from optimum.intel.openvino import OVModelForSequenceClassification, OVQuantizer, OVTrainer, OVConfig, OVTrainingArguments
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline, AutoConfig

transformers.logging.set_verbosity_error()
datasets.logging.set_verbosity_error()

INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx


## Preview the Dataset

The `datasets` library makes it easy to load datasets. Common datasets can be loaded from the Hugging Face Hub by providing the name of the dataset. See https://github.com/huggingface/datasets. We can load the SQuAD dataset with `load_dataset` and show a random dataset item. Every dataset item in the SQuAD dataset has a unique id, a title which denotes the category, a context and a question, and answers. The answer is a subset of the context, and both the text of the answer, and the start position of the answer in the context (`answer_start`) are returned.

In [2]:
TASK_NAME = "sst2"
dataset = datasets.load_dataset("glue", TASK_NAME)
metric = evaluate.load('glue', TASK_NAME)

  0%|          | 0/3 [00:00<?, ?it/s]

## Settings

We define MODEL_ID and DATASET_NAME, and the paths for the quantized model file.

In [3]:
MODEL_ID = "textattack/bert-base-uncased-SST-2"
TEACHER_MODEL_ID = 'TehranNLP-org/bert-large-sst2'
base_model_path = Path(f"/tmp/yujiepan/optimum-notebook/{MODEL_ID}")
fp32_model_path = base_model_path.with_name(base_model_path.name + "_FP32")
jpqd_model_path = base_model_path.with_name(base_model_path.name + "_JPQD")

## Load Model and Tokenizer

We load the model from the Hugging Face Hub. The model will be automatically downloaded if it has not been downloaded before, or loaded from the cache otherwise.

We also load the tokenizer, which converts the questions and contexts from the dataset to tokens, converting the inputs in a format the model expects.

In [4]:
labels = dataset['train'].features['label'].names
num_labels = len(labels)
id2label = dict(zip(range(num_labels), labels))
label2id = dict(zip(labels, range(num_labels)))

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_ID,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)
model.save_pretrained(fp32_model_path)
teacher_model = AutoModelForSequenceClassification.from_pretrained(
    TEACHER_MODEL_ID,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

In [5]:
from contextlib import contextmanager
from unittest.mock import patch

@contextmanager
def patch_tokenizer():
    _original_call = tokenizer.__class__.__call__

    def _new_call(self, *args, **kwargs):
        kwargs['max_length'] = 128
        kwargs['padding'] = 'max_length'
        kwargs['truncation'] = True
        return _original_call(self, *args, **kwargs)

    with patch('.'.join([_original_call.__module__, _original_call.__qualname__]), _new_call):
        yield


with patch_tokenizer():
    print(tokenizer("Hello world"))


{'input_ids': [101, 7592, 2088, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Joint pruning, quantization and distlllation

For post-training quantization (PTQ), we first start by loading the model using the `AutoModelForQuestionAnswering` class. After instantiating an `OVQuantizer`, we need to provide a dataset for the calibration step. You can apply quantization on your model by calling the `quantize` method. That's all!

### Preprocess the Dataset

We need a representative calibration dataset to quantize the model. The SQuAD dataset is pretrained on a large dataset with a wide variety of questions and answers, and it generalizes pretty well to questions and contexts it has never seen before. For production use, you would finetune this dataset with questions and context specific to your domain. In this notebook, we use a subset of the SQuAD dataset, for demonstration purposes. We chose the _Super Bowl 50_ category from the validation subset of SQuAD because it has a large number of questions.

Post-training quantization does not need a training and validation dataset, because we will not train the model, but we define these splits here to use a training split for calibration, and a validation split for validation.

In [6]:
task_to_keys = {
    "cola": ("sentence", None),
    "mnli": ("premise", "hypothesis"),
    "mnli-mm": ("premise", "hypothesis"),
    "mrpc": ("sentence1", "sentence2"),
    "qnli": ("question", "sentence"),
    "qqp": ("question1", "question2"),
    "rte": ("sentence1", "sentence2"),
    "sst2": ("sentence", None),
    "stsb": ("sentence1", "sentence2"),
    "wnli": ("sentence1", "sentence2"),
}

@patch_tokenizer()
def preprocess_fn(examples):
    sentence1_key, sentence2_key = task_to_keys[TASK_NAME]
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key])
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

encoded_dataset = dataset.map(preprocess_fn, batched=True)
encoded_dataset['train'][:5]

  0%|          | 0/68 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

{'sentence': ['hide new secretions from the parental units ',
  'contains no wit , only labored gags ',
  'that loves its characters and communicates something rather beautiful about human nature ',
  'remains utterly satisfied to remain the same throughout ',
  'on the worst revenge-of-the-nerds clichés the filmmakers could dredge up '],
 'label': [0, 0, 1, 0, 0],
 'idx': [0, 1, 2, 3, 4],
 'input_ids': [[101,
   5342,
   2047,
   3595,
   8496,
   2013,
   1996,
   18643,
   3197,
   102,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
 

### Define the compression config
Please see NNCF....

In [7]:
compression_config = [
    {
        "algorithm": "movement_sparsity",
        "params": {
            "warmup_start_epoch": 1,
            "warmup_end_epoch": 2,
            "importance_regularization_factor": 0.05,
            "enable_structured_masking": True,
        },
        "sparse_structure_by_scopes": [
            {"mode": "block", "sparse_factors": [32, 32], "target_scopes": "{re}.*BertAttention.*"},
            {"mode": "per_dim", "axis": 0, "target_scopes": "{re}.*BertIntermediate.*"},
            {"mode": "per_dim", "axis": 1, "target_scopes": "{re}.*BertOutput.*"},
        ],
        "ignored_scopes": [
            "{re}.*NNCFEmbedding.*",
            "{re}.*LayerNorm.*",
            "{re}.*pooler.*",
            "{re}.*classifier.*"]
    },
    {
        "algorithm": "quantization",
        "preset": "mixed",
        "overflow_fix": "disable",
        "initializer": {
            "range": {
                "num_init_samples": 32,
                "type": "percentile",
                "params":
                {
                    "min_percentile": 0.01,
                    "max_percentile": 99.99
                }
            },
            "batchnorm_adaptation": {
                "num_bn_adaptation_samples": 32
            }
        },
        "scope_overrides": {"activations": {"{re}.*matmul_0": {"mode": "symmetric"}}},
        "ignored_scopes": [
            "{re}.*__add___[0-1]",
            "{re}.*layer_norm_0",
            "{re}.*matmul_1",
            "{re}.*__truediv__*",
        ],
    }
]

ov_config = OVConfig(compression=compression_config)

### Distiil the Model with JPQD

In [8]:
batch_size= 64
learning_rate = 2e-5
num_train_epochs = 3

args = OVTrainingArguments(
    jpqd_model_path,
    do_train=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=learning_rate,
    optim='adamw_torch',
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    push_to_hub=False,
    report_to='none'
)

In [9]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Initialize the `OVTrainer`. During the initialization, we will modify the model to supportquantization. So iwll be a bit slow.

In [10]:
trainer = OVTrainer(
    model,
    teacher_model=teacher_model,
    args=args,
    train_dataset=encoded_dataset["train"].select(range(512)),
    eval_dataset=encoded_dataset["validation"].select(range(512)),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    ov_config=ov_config,
    feature='text-classification'
)

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Train and save the model


In [11]:
trainer.train()
trainer.save_model()

The following columns in the training set don't have a corresponding argument in `NNCFNetwork.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `NNCFNetwork.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 512
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 24
The following columns in the evaluation set don't have a corresponding argument in `NNCFNetwork.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `NNCFNetwork.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 512
  Batch size = 64
Saving model checkpoint to /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-8
Configuration saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD

{'eval_loss': 0.2939246892929077, 'eval_accuracy': 0.912109375, 'eval_runtime': 7.5664, 'eval_samples_per_second': 67.668, 'eval_steps_per_second': 1.057, 'epoch': 1.0}


Model weights saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-8/pytorch_model.bin
tokenizer config file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-8/tokenizer_config.json
Special tokens file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-8/special_tokens_map.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

  output = g.op(
  _C._jit_pass_onnx_node_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The following columns in the evaluation set don't have a corresponding argument in `NNCFNetwork.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `NNCFNetwork.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 512
  Batch size = 64
Saving model checkpoint to /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-16
Configuration saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-16/config.json


{'eval_loss': 14.533432006835938, 'eval_accuracy': 0.75390625, 'eval_runtime': 7.647, 'eval_samples_per_second': 66.955, 'eval_steps_per_second': 1.046, 'epoch': 2.0}


Model weights saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-16/pytorch_model.bin
tokenizer config file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-16/tokenizer_config.json
Special tokens file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-16/special_tokens_map.json
  output = g.op(
  _C._jit_pass_onnx_node_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The following columns in the evaluation set don't have a corresponding argument in `NNCFNetwork.forward` and have been ignored: idx, sentence. If idx, sentence are not expected by `NNCFNetwork.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 512
  Batch size = 64
Saving model checkpoint to /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-24
Configuration saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-24/config.json


{'eval_loss': 0.261968195438385, 'eval_accuracy': 0.916015625, 'eval_runtime': 7.6717, 'eval_samples_per_second': 66.738, 'eval_steps_per_second': 1.043, 'epoch': 3.0}


Model weights saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-24/pytorch_model.bin
tokenizer config file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-24/tokenizer_config.json
Special tokens file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/checkpoint-24/special_tokens_map.json
  output = g.op(
  _C._jit_pass_onnx_node_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD
Configuration saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/config.json


{'train_runtime': 123.8275, 'train_samples_per_second': 12.404, 'train_steps_per_second': 0.194, 'train_loss': 3.4295008977254233, 'epoch': 3.0}


Model weights saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/pytorch_model.bin
tokenizer config file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/tokenizer_config.json
Special tokens file saved in /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/special_tokens_map.json
  output = g.op(
  _C._jit_pass_onnx_node_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(
  _C._jit_pass_onnx_graph_shape_type_inference(


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Compare FP32 model with optimized model by JPQD

We compare the accuracy, model size and inference results and latency of the FP32 and INT8 models.
### Inference Pipeline

Transformers [Pipelines](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial) simplify model inference. A `Pipeline` is created by adding a task, model and tokenizer to the `pipeline` function. Inference is then as simple as `qa_pipeline({"question": question, "context": context})`.

We create two pipelines: `hf_qa_pipeline` and `ov_qa_pipeline_ptq` to compare the FP32 PyTorch model with the OpenVINO INT8 model. These pipelines will also be used for showing the accuracy difference and for benchmarking later in this notebook.

In [12]:
optimized_model = OVModelForSequenceClassification.from_pretrained(jpqd_model_path)
original_model = AutoModelForSequenceClassification.from_pretrained(fp32_model_path)
ov_sst2_pipeline_jpqd = pipeline("text-classification", model=optimized_model, tokenizer=tokenizer)
hf_sst2_pipeline = pipeline("text-classification", model=original_model, tokenizer=tokenizer)

loading configuration file /tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/config.json
Model config BertConfig {
  "_name_or_path": "/tmp/yujiepan/optimum-notebook/textattack/bert-base-uncased-SST-2_JPQD/config.json",
  "architectures": [
    "NNCFNetwork"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "finetuning_task": "sst-2",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "negative",
    "1": "positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 0,
    "positive": 1
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.26.0",
  "type_vocab_size": 2,
  "use_cache": true,
 

In [13]:
with patch_tokenizer():
    print(ov_sst2_pipeline_jpqd('I love you.'))

[{'label': 'positive', 'score': 0.9995125532150269}]


In [14]:
hf_sst2_pipeline('I love you.')

[{'label': 'positive', 'score': 0.9987971782684326}]

### Accuracy

We load the quantized model and the original FP32 model, and compare the metrics on both models. The [evaluate](https://github.com/huggingface/evaluate) library makes it very easy to evaluate models on a given dataset, with a given metric. For the SQuAD dataset, the F1 score and Exact Match metrics are returned.

To load the quantized model with OpenVINO, we use the `OVModelForQuestionAnswering` class. It can be used in the same way as [`AutoModelForQuestionAnswering`](https://huggingface.co/docs/transformers/main/model_doc/auto).

The pipelines we created in the previous section are used to perform evaluation.

In [15]:
glue_eval = evaluator("text-classification")
input_column = task_to_keys[TASK_NAME][0]

with patch_tokenizer():
    ov_eval_results = glue_eval.compute(
        model_or_pipeline=ov_sst2_pipeline_jpqd,
        data=dataset['validation'].select(range(512)),
        metric=metric,
        input_column=input_column,
        label_mapping=label2id,
    )
    hf_eval_results = glue_eval.compute(
        model_or_pipeline=hf_sst2_pipeline,
        data=dataset['validation'].select(range(512)),
        metric=metric,
        input_column=input_column,
        label_mapping=label2id,
    )

pd.DataFrame.from_records(
    [hf_eval_results, ov_eval_results],
    index=["FP32", "optimized by JPQD"],
).round(2)


Disabling tokenizer parallelism, we're using DataLoader multithreading already


Unnamed: 0,accuracy,total_time_in_seconds,samples_per_second,latency_in_seconds
FP32,0.93,30.05,17.04,0.06
optimized by JPQD,0.92,10.9,46.98,0.02


### Inference Results

To fully understand the quality of a model, it is useful to look beyond metrics like Exact Match and F1 score and examine model predictions directly. This can give a more complete impression of the model's performance and help identify areas for improvement.


In [16]:
results = []
validation_examples = dataset['validation'].select(range(10))

with patch_tokenizer():
    for item in validation_examples:
        fp32_outputs = hf_sst2_pipeline(item[input_column])[0]
        jpqd_outputs = ov_sst2_pipeline_jpqd(item[input_column])[0]
        results.append([
            item[input_column],
            fp32_outputs['label'], fp32_outputs['score'],
            jpqd_outputs['label'], jpqd_outputs['score']
        ])

pd.set_option("display.max_colwidth", None)
pd.DataFrame(
    results,
    columns=['sample','fp32_label', 'fp32_score', 'jpqd_label', 'jpqd_score']
)

Unnamed: 0,sample,fp32_label,fp32_score,jpqd_label,jpqd_score
0,it 's a charming and often affecting journey .,positive,0.999766,positive,0.999717
1,unflinchingly bleak and desperate,negative,0.985671,negative,0.995319
2,allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker .,positive,0.999574,positive,0.999692
3,"the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales .",positive,0.996028,positive,0.999448
4,"it 's slow -- very , very slow .",negative,0.997952,negative,0.997852
5,"although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women .",positive,0.999637,positive,0.999698
6,a sometimes tedious film .,negative,0.996069,negative,0.996758
7,or doing last year 's taxes with your ex-wife .,negative,0.994072,negative,0.995128
8,you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance .,positive,0.99964,positive,0.999723
9,"in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .",negative,0.997039,negative,0.997929


### Model Size

We save the FP32 PyTorch model and define a function to show the model size for the PyTorch and OpenVINO models.

In [17]:
def get_model_size(model_folder, framework):
    """
    Return OpenVINO or PyTorch model size in Mb. 
    Arguments: 
        model_folder: 
            Directory containing a pytorch_model.bin for a PyTorch model, and an openvino_model.xml/.bin for an OpenVINO model. 
        framework: 
            Define whether the model is a PyTorch or an OpenVINO model.
    """
    if framework.lower() == "openvino":
        model_path = Path(model_folder) / "openvino_model.xml"
        model_size = model_path.stat().st_size + model_path.with_suffix(".bin").stat().st_size
    elif framework.lower() == "pytorch":
        model_path = Path(model_folder) / "pytorch_model.bin"
        model_size = model_path.stat().st_size
    model_size /= 1000 * 1000
    return model_size

fp32_model_size = get_model_size(fp32_model_path, "pytorch")
int8_model_size = get_model_size(jpqd_model_path, "openvino")
print(f"FP32 model size: {fp32_model_size:.2f} MB")
print(f"INT8 model size: {int8_model_size:.2f} MB")
print(f"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x")

FP32 model size: 438.01 MB
INT8 model size: 106.14 MB
INT8 size decrease: 4.13x


### Benchmarks

Compare the inference speed of the quantized OpenVINO model with that of the original PyTorch model.

This benchmark provides an estimate of performance, but keep in mind that other programs running on the computer, as well as power management settings, can affect performance.

In [18]:
@patch_tokenizer()
def benchmark(pipeline, dataset, num_items=100):
    """
    Benchmark PyTorch or OpenVINO model. This function does inference on `num_items`
    dataset items and returns the median latency in milliseconds
    """
    latencies = []
    for i, item in enumerate(dataset.select(range(num_items))):
        start_time = time.perf_counter()
        results = pipeline(item[input_column])
        end_time = time.perf_counter()
        latencies.append(end_time - start_time)

    return np.median(latencies) * 1000


original_latency = benchmark(hf_sst2_pipeline, dataset['validation'])
quantized_latency = benchmark(ov_sst2_pipeline_jpqd, dataset['validation'])
cpu_device_name = Core().get_property("CPU", "FULL_DEVICE_NAME")

print(cpu_device_name)
print(f"Latency of original FP32 model: {original_latency:.2f} ms")
print(f"Latency of quantized model: {quantized_latency:.2f} ms")
print(f"Speedup: {(original_latency/quantized_latency):.2f}x")

Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Latency of original FP32 model: 57.60 ms
Latency of quantized model: 18.07 ms
Speedup: 3.19x


In [40]:
stat_lines = trainer.compression_controller.statistics().to_str().split('\n')
print('\n'.join(stat_lines[:9]))

Statistics of the sparsified model:
+-----------------------------------------+-------+
|            Statistic's name             | Value |
| Sparsity level of the whole model       | 0.045 |
+-----------------------------------------+-------+
| Sparsity level of all sparsified layers | 0.058 |
+-----------------------------------------+-------+

