# Accelerate Inference of sparse Transformer models with OpenVINO™ and 4th Gen Intel&reg; Xeon&reg; Scalable processors
This tutorial demonstrates how to improve performance of sparse Transformer models with [OpenVINO](https://docs.openvino.ai/) on 4th Gen Intel® Xeon® Scalable processors. It uses a pre-trained model from the [HuggingFace Transformers](https://huggingface.co/transformers/) library and shows how to convert it to the OpenVINO™ IR format and run inference of the model on the CPU using a dedicated runtime option that enables sparsity optimizations. It also demonstrates how to get more performance stacking sparsity with 8-bit quantization. To simplify the user experience, the [HuggingFace Optimum](https://huggingface.co/docs/optimum) library is used to convert the model to the OpenVINO™ IR format and quantize it using [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf). It consists of the following steps:

- Install prerequisites
- Download and quantize sparse BERT model from the public using HuggingFace Optimum for OpenVINO.
- Compare sparse 8-bit vs. dense 8-bit inference performance.


## Prerequisites

In [None]:
!pip install optimum[openvino]

In [None]:
try:
    import nncf  # noqa: F401
except ImportError:
    !pip install git+https://github.com/openvinotoolkit/nncf.git#egg=nncf

## Imports

In [None]:
import time
from functools import partial
from pathlib import Path

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

from optimum.intel.openvino import OVQuantizer
from optimum.intel.openvino import OVConfig

## Quantize model with HuggingFace Optimum API
The sparsity acceleration MatMul operations is available only in the case when these operations are quantized into 8-bit precision. If the model is not quantized it can be done using either way availble for OpenVINO models. For more details refer [here](https://docs.openvino.ai/latest/openvino_docs_model_optimization_guide.html). In this tutorial we use the HuggingFace Optimum API to quantize the model. The HuggingFace Optimum API is a high-level API that allows to convert and quantize models from the HuggingFace Transformers library to the OpenVINO™ IR format. For more details refer to the [HuggingFace Optimum documentation](https://huggingface.co/docs/optimum/intel/optimization_ov).

In [None]:
model_id = "yujiepan/bert-base-uncased-sst2-unstructured-sparsity-80"
quantized_sparse_dir = Path("bert_80_sparse_quantized")

# Instantiate model and tokenizer in PyTorch and load them from the HF Hub
torch_model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

def preprocess_function(examples, tokenizer):
    """
    Define a function that tokenizes the data and returns it in the format expected by the model.
    
    :param: examples: a dictionary containing the input data which are the items from caliration dataset.
            tokenizer: a tokenizer object that is used to tokenize the text data.
    :returns:
            the data that can be fed directly to the model.
    """
    
    return tokenizer(
        examples["sentence"], padding="max_length", max_length=128, truncation=True
    )

# Create quantization config (default) and OVQuantizer
# OVConfig is a wrapper class on top of NNCF config. 
# Use "compression" field to control quantization parameters
# For more information about the parameters refer to NNCF GitHub documentatioin
quantization_config = OVConfig()
quantizer = OVQuantizer.from_pretrained(torch_model, feature="sequence-classification")

# Instantiate a dataset and convert it to calibration dataset using HF API
# The latter one produces a model input
dataset = load_dataset("glue", "sst2")
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=100,
    dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
    quantization_config=quantization_config, calibration_dataset=calibration_dataset, save_directory=quantized_sparse_dir
)

## Benchmark quantized dense inference performance
Benchmark dense inference performance using parallel execution on four CPU cores to simulate a small instance in the cloud infrastructure. Sequense length is set to 16 which is common for multiple use cases, e.g. conversational AI.

In [None]:
# Dump benchmarking config for dense inference
with open("perf_config.json", "w") as outfile:
    outfile.write(
"""
{
    "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4}
}
""")

In [None]:
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,16],attention_mask[1,16],token_type_ids[1,16]" -load_config perf_config.json

## Benchmark quantized sparse inference performance

In [None]:
# Dump benchmarking config for dense inference
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" controls minimum sparsity rate for weights to consider 
# for sparse optimization at the runtime.
with open("perf_config_sparse.json", "w") as outfile:
    outfile.write(
"""
{
    "CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.8}
}
""")

In [None]:
!benchmark_app -m $quantized_sparse_dir/openvino_model.xml -shape "input_ids[1,16],attention_mask[1,16],token_type_ids[1,16]" -load_config perf_config_sparse.json

## When this might be helpful

This feauture can improve inference performance for models with sparse weights in the scenarios when the model is deployed to handle multiple requests in parallel asyncronously. It is especially helpful in the case of small sequence length, e.g. 32 and lower.

For more details about the asynchronous inference with OpenVINO refer to the following documentation:
- [Deployment Optimization Guide](https://docs.openvino.ai/latest/openvino_docs_deployment_optimization_guide_common.html#doxid-openvino-docs-deployment-optimization-guide-common-1async-api)
- [Inference Request API](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Infer_request.html#doxid-openvino-docs-o-v-u-g-infer-request-1in-out-tensors)