# GPT-2 Text Prediction with OpenVINO

This notebook shows a text prediction with OpenVINO. We use the  [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model, which is a part of the Generative Pre-trained Transformer (GPT) family. GPT-2 is pre-trained on a large corpus of English text using unsupervised training. The model is available from [HuggingFace](https://huggingface.co/gpt2). GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we can prime the model with an input and have it generate a lengthy continuation.

The following image illustrates complete demo pipeline used for this scenario:

![image2](https://user-images.githubusercontent.com/91228207/163990722-d2713ede-921e-4594-8b00-8b5c1a4d73b5.jpeg)

Model input is tokenized text, which serves as initial condition for generation, then logits from model inference result should be obtained and token with the highest probability is selected using top-k sampling strategy and joined to input sequence. The procedure repeats until the end of sequence token will be received or specified maximum length will be reached. After that, decoding token ids to text using tokenized should be applied.


## The model


In [1]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
pt_model = GPT2LMHeadModel.from_pretrained('gpt2')

2023-04-06 02:15:32.738364: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-06 02:15:32.803278: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-06 02:15:33.118060: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/slyalin/openvino/install/tools/compile_tool:/home/slyalin/openvino/install/runtime/lib/intel

In [2]:
# Choose one of the modes
mode = 'stateful'   # wraps compiled_model in a wrapper which handles cached key/value pairs from previous iteration
#mode = 'stateless' # the orginal version of the code which passes the entrire input to the encoder without using cached values from previous iterations

## Convert GPT-2 to OpenVINO IR

![conversion_pipeline](https://user-images.githubusercontent.com/29454499/211261803-784d4791-15cb-4aea-8795-0969dfbb8291.png)

For starting work with GPT2 model using OpenVINO, model should be converted to OpenVINO Intermediate Represenation (IR) format. HuggingFace provided gpt2 model is a PyTorch model, which is supported in OpenVINO via conversion to ONNX. We will use HuggingFace transformers library capabilities to export model to ONNX. `transformers.onnx.export` accepts preprocessing function for input sample generation (tokenizer in this case), instance of model, ONNX export configuration, ONNX opset version for export and output path. More information about transformers export to ONNX can be found in HuggingFace [documentation](https://huggingface.co/docs/transformers/serialization).

While ONNX models are directly supported by OpenVINO runtime, it can be useful to convert them to IR format to take advantage of OpenVINO optimization tools and features.
The `mo.convert_model` python function can be used for converting model with [OpenVINO Model Optimizer](https://docs.openvino.ai/latest/openvino_docs_MO_DG_Python_API.html). The function returns instance of OpenVINO Model class, which is ready to use in Python interface. However, it can also be serialized to OpenVINO IR format for future execution using `openvino.runtime.serialize`. In this case, `compress_to_fp16` parameter is enabled for compression model weights to `FP16` precision and also specified dynamic input shapes with possible shape range (from one token to maximum length defined in the processing function) for optimization of memory consumption.

In [3]:
from pathlib import Path
from openvino.runtime import serialize
from openvino.tools import mo
from transformers.onnx import export, FeaturesManager


# define path for saving onnx model
onnx_path = Path("model/gpt2.onnx")
onnx_path.parent.mkdir(exist_ok=True)

# define path for saving openvino model
model_path = onnx_path.with_suffix(".xml")

if mode == 'stateful':
    # Will generate extra inputs and output for cached key-value pairs
    feature = 'causal-lm-with-past'
else:
    feature = 'causal-lm'

# get model onnx config function for output feature format casual-lm
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(pt_model, feature=feature)

# fill onnx config based on pytorch model config
onnx_config = model_onnx_config(pt_model.config)

# convert model to onnx
onnx_inputs, onnx_outputs = export(tokenizer, pt_model, onnx_config, onnx_config.default_onnx_opset, onnx_path)

# convert model to openvino
ov_model = mo.convert_model(onnx_path, compress_to_fp16=True)

# serialize openvino model
serialize(ov_model, str(model_path))

  if batch_size <= 0:


verbose: False, log level: Level.ERROR



### Load the model

We can start by building an OpenVINO Core object. Then, read the network architecture and model weights from the `.xml` and `.bin` files, respectively. Finally, we compile the model for the desired device. Since we use the dynamic shapes feature, which is only available on CPU, we must use `CPU` for the device. Dynamic shapes support on GPU is coming soon.

Since the text recognition model has a dynamic input shape, you cannot directly switch device to `GPU` for inference on integrated or discrete Intel GPUs. In order to run inference on iGPU or dGPU with this model, you will need to resize the inputs to this model to use a fixed size. Then, try running the inference on `GPU` device.

In [4]:
import openvino.runtime as ov

In [5]:
from openvino.runtime import Core

# initialize openvino core
core = Core()

# read the model and corresponding weights from file
model = core.read_model(model_path)

# compile the model for CPU devices
stateless_compiled_model = core.compile_model(model=model, device_name="CPU")

# get output tensors
output_key = stateless_compiled_model.output(0)  # TODO: always 0th input? should we better use 'logits'?

In [6]:
# Wrap compiled_model in a helper structure that handles the state
class StatefulModel:
    def __init__(self, compiled_model):
        self.compiled_model = compiled_model
        self.build_pairs()
        self.reset()
        
    def build_pairs(self):
        self.state_pairs = []
        # Identify input and output pairs that carry state based on names
        # The code can be simplified if it is guaranteed that the indices of them are the same in the inputs and outputs
        for input in self.compiled_model.inputs:
            input_prefix = 'past_key_values.'
            output_prefix = 'present.'
            if input.any_name.startswith(input_prefix):
                self.state_pairs.append((input.any_name, self.compiled_model.output(output_prefix + input.any_name[len(input_prefix):])))
        
    def __call__(self, kwargs):
        p1 = time.perf_counter()
        if self.state is None:
            # Populate state with empty tensors for the first iteration
            # Need to know where the batch and sequence dimensions are, because we need to allocate tensors with correct batch dimension set
            for input_name, output in self.state_pairs:
                shape = self.compiled_model.input(input_name).get_partial_shape()
                shape[0] = kwargs['input_ids'].shape[0]  # batch dimension
                shape[2] = 0 # sequence dimension
                kwargs[input_name] = ov.Tensor(self.compiled_model.input(input_name).get_element_type(), shape.get_shape())
        else:
            kwargs.update(self.state)
        p2 = time.perf_counter()
        # TODO: use async infer request with shared tensors between inputs and outputs
        # But even with this naively used callable it gives performance improvements
        outputs = self.compiled_model(kwargs)
        p3 = time.perf_counter()
        first_time = self.state is None
        self.state = {}
        for input_name, output in self.state_pairs:
            self.state[input_name] = outputs[output]
        p4 = time.perf_counter()
        if first_time:
            print(f'perf p1..p2: {p2-p1}, p2..p3: {p3-p2}, p3..p4: {p4-p3}, ')
        return outputs
        
    # reset state to be ready for a new sequence
    def reset(self):
        self.state = None

In [7]:
if mode == 'stateful':
    compiled_model = StatefulModel(stateless_compiled_model)
else:
    compiled_model = stateless_compiled_model

Input keys are the names of the input nodes and output keys contain names of the output nodes of the network. With GPT-2, we have `batch size` and `sequence length` as inputs and `batch size`, `sequence length` and `vocab size` as outputs.

## Pre-Processing

NLP models often take a list of tokens as a standard input. A token is a single word mapped to an integer. To provide the proper input, we use a vocabulary file to handle the mapping. So first let us load the vocabulary file.

## Define tokenization

In [8]:
# this function converts text to tokens
def tokenize(text):
    """
    tokenize input text using GPT2 tokenizer
    
    Parameters:
      text, str - input text
    Returns:
      input_ids - np.array with input token ids
      attention_mask - np.array with 0 in place, where should be padding and 1 for places where original tokens are located, represents attention mask for model 
    """
    
    inputs = tokenizer(text, return_tensors="np")
    return inputs["input_ids"], inputs["attention_mask"]

`eos_token` is a special token, which means that generation is finished. We store the index of this token in order to use this index as padding at later stage.

In [9]:
eos_token_id = tokenizer.eos_token_id

### Define Softmax layer
A softmax function is used to convert top-k logits into a probability distribution.

In [10]:
import numpy as np


def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    summation = e_x.sum(axis=-1, keepdims=True)
    return e_x / summation

### Set the minimum sequence length
If the minimum sequence length is not reached, the following code will reduce the probability of the `eos` token occurring. This continues the process of generating the next words.

In [11]:
def process_logits(cur_length, scores, eos_token_id, min_length=0):
    """
    reduce probability for padded indicies
    
    Parameters:
      cur_length - current length of input sequence
      scores - model output logits
      eos_token_id - index of end of string token in model vocab
      min_length - minimum length for appling postprocessing
    """
    if cur_length < min_length:
        scores[:, eos_token_id] = -float("inf")
    return scores

### Top-K sampling
In Top-K sampling, we filter the K most likely next words and redistribute the probability mass among only those K next words.

In [12]:
def get_top_k_logits(scores, top_k):
    """
    perform top-k sampling
    
    Parameters:
      scores - model output logits
      top_k - number of elements with highest probability to select
    """
    filter_value = -float("inf")
    top_k = min(max(top_k, 1), scores.shape[-1])
    top_k_scores = -np.sort(-scores)[:, :top_k]
    indices_to_remove = scores < np.min(top_k_scores)
    filtred_scores = np.ma.array(scores, mask=indices_to_remove,
                                 fill_value=filter_value).filled()
    return filtred_scores

### Main Processing Function
Generate the predicted sequence.

In [13]:
def generate_sequence(input_ids, attention_mask, max_sequence_length=128,
                      eos_token_id=eos_token_id, dynamic_shapes=True):
    """
    text prediction cycle.

    Parameters:
      input_ids: tokenized input ids for model
      attention_mask: attention mask for model
      max_sequence_length: maximum sequence length for stop iteration
      eos_token_ids: end of sequence index from vocab
      dynamic_shapes: use dynamic shapes for inference or pad model input to max_sequece_length
    Returns:
      predicted token ids sequence
    """
    if isinstance(compiled_model, StatefulModel):
        compiled_model.reset()
    model_input_ids = input_ids
    while True:
        cur_input_len = len(input_ids[0])
        if not dynamic_shapes and not isinstance(compiled_model, StatefulModel):
            pad_len = max_sequence_length - cur_input_len
            model_input_ids = np.concatenate((input_ids, [[eos_token_id] * pad_len]), axis=-1)
            model_input_attention_mask = np.concatenate((attention_mask, [[0] * pad_len]), axis=-1)
        else:
            model_input_attention_mask = attention_mask

        start = time.perf_counter()
        outputs = compiled_model({"input_ids": model_input_ids, "attention_mask": model_input_attention_mask})[output_key]
        end = time.perf_counter()
        print(end - start)

        next_token_logits = outputs[:, -1, :]
        # pre-process distribution
        next_token_scores = process_logits(cur_input_len,
                                           next_token_logits, eos_token_id)
        top_k = 20
        next_token_scores = get_top_k_logits(next_token_scores, top_k)
        # get next token id
        probs = softmax(next_token_scores)
        next_tokens = np.random.choice(probs.shape[-1], 1,
                                       p=probs[0], replace=True)
        # break the loop if max length or end of text token is reached
        if cur_input_len == max_sequence_length or next_tokens == eos_token_id:
            break
        else:
            input_ids = np.concatenate((input_ids, [next_tokens]), axis=-1)
            attention_mask = np.concatenate((attention_mask, [[1] * len(next_tokens)]), axis=-1)
            if isinstance(compiled_model, StatefulModel):
                model_input_ids = np.array([next_tokens])
            else:
                model_input_ids = input_ids
    return input_ids

## Run
The `text` variable below is the input used to generate a predicted sequence.

In [16]:
import time
text = "Deep learning is a type of machine learning that uses neural networks"
input_ids, attention_mask = tokenize(text)

start = time.perf_counter()
output_ids = generate_sequence(input_ids, attention_mask)
end = time.perf_counter()
output_text = " "
# Convert IDs to words and make the sentence from it
for i in output_ids[0]:
    output_text += tokenizer.convert_tokens_to_string(tokenizer._convert_id_to_token(i))
print(f"Generation took {end - start:.3f} s")
print("Input Text: ", text)
print()
print(f"Predicted Sequence:{output_text}")

perf p1..p2: 0.00022553419694304466, p2..p3: 0.0276347859762609, p3..p4: 6.847991608083248e-05, 
0.028029551962390542
0.018220022087916732
0.01706844987347722
0.016715022968128324
0.01680465997196734
0.01668190094642341
0.01675644190981984
0.016724367858842015
0.016768553061410785
0.016993527067825198
0.016904365038499236
0.017442172160372138
0.017024751054123044
0.01684852479957044
0.016935155959799886
0.016917634988203645
0.017213913146406412
0.017006686190143228
0.017284360947087407
0.017151687061414123
0.017061993945389986
0.01705497782677412
0.017383768921718
0.017145529156550765
0.017349696019664407
0.017131561879068613
0.017157042864710093
0.017222587950527668
0.01718850084580481
0.0172158379573375
0.0171463789884001
0.01733879093080759
0.017277144128456712
0.017578825121745467
0.01736408704891801
0.017358443001285195
0.01736928103491664
0.01738218078389764
0.01739831198938191
0.01743061002343893
0.017398732947185636
0.017438181908801198
0.017762296833097935
0.01847860193811357
