<a href="https://colab.research.google.com/github/rahiakela/small-language-models-fine-tuning/blob/main/domain-specific-small-language-models/08-model-profiling/01_profiling_linear_regression_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Profiling Linear Regression ONNX Models

The code in this notebook is about profiling and getting performance insights for Linear Regression model after conversion to the [ONNX](https://onnx.ai/) format and optimization. The same code applies to any other LLM and the insights building part is generic for any ML/DL ONNX model profiling analysis. No hardware acceleration needed.  

Install the missing dependencies in the Colab VM (only ONNX and the ONNX runtime, plus mlprodict (for profiling data aggregation and clean up only). Please see note later in this notebook about the mlprodict package installation in later versions of the Colab runtime.

In [None]:
!pip install onnx onnxruntime

In [None]:
!pip install skl2onnx

In [25]:
!pip install mlprodict

Collecting mlprodict
  Using cached mlprodict-0.9.1883.tar.gz (814 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


Import the required packages and classes.

In [21]:
import logging
import numpy as np
import torch
import json

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from skl2onnx import to_onnx
from mlprodict.onnxrt.ops_whole.session import OnnxWholeSession
from onnxruntime import InferenceSession, SessionOptions

ModuleNotFoundError: No module named 'mlprodict'

Let's download diabetes dataset and then split the data into training and test.

In [3]:
data = load_diabetes()

In [6]:
X, y = data.data, data.target

X_train, X_test, y_train, _ = train_test_split(X, y, random_state=42)

In [7]:
print(f"Training Shape: {X_train.shape}, Testing Shape: {X_test.shape}")

Training Shape: (331, 10), Testing Shape: (111, 10)


We can now train the model staying in the scikit-learn API.

In [13]:
clr = LinearRegression()
clr.fit(X_train, y_train)

Export the model to ONNX.

In [14]:
model_def = to_onnx(clr, X_train)

In [15]:
model_def

ModelProto(ir_version=10, opset_import={'': 13}, domain='ai.onnx', producer_name='skl2onnx', producer_version='1.19.1', graph=GraphProto('ONNX(LinearRegression)', input=<1 inputs>, output=<1 outputs>, initializer=<3 initializers>, node=<3 nodes>))

This converted model can be profiled.

In [17]:
so = SessionOptions()
so.enable_profiling = True

sess = InferenceSession(model_def.SerializeToString(), so)

We now run inference on the whole test set (111 samples) and stop the
profiling at the end:

In [18]:
sess.run(None, {"X": X_test})

prof = sess.end_profiling()

We load now the generated raw profiling data from the created JSON file.



In [20]:
with open(prof, "r") as f:
    js = json.load(f)

and then create a Pandas DataFrame.

# Model Optimization

Set up the logging level to see in the output which kind of optimizations are automatically applied.

In [None]:
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)

Optimize the model using the ONNX's native optimizer.

In [None]:
from onnxruntime.transformers import optimizer

onnx_optim_model_path="gpt2onnx-opt.onnx"
optimized_model = optimizer.optimize_model(onnx_model_path,
                                           model_type='gpt2',
                                           num_heads=num_attention_heads,
                                           hidden_size=hidden_size,
                                           use_gpu=False,
                                           opt_level=1,
                                           verbose=True)
optimized_model.convert_float_to_float16()
optimized_model.save_model_to_file(onnx_optim_model_path)

Run text generation using the ONNX optimized model with profiling enabled.

In [None]:
import onnx

optimized_onnx_model = onnx.load(onnx_optim_model_path)

tokenizer.pad_token = tokenizer.eos_token
input_ids, attention_mask, position_ids, empty_past = get_example_inputs(
    ['Here is some text to encode Hello World'], tokenizer, num_layer)

so = onnxruntime.SessionOptions()
so.enable_profiling = True
session = onnxruntime.InferenceSession(onnx_optim_model_path, so,
                                       providers=["CPUExecutionProvider"])
ort_inputs = {
    "input_ids": np.ascontiguousarray(input_ids.cpu().numpy()),
}
ort_outputs = session.run(None, ort_inputs)
prof_optimized = session.end_profiling()

# Profiling Data Clean Up and Visualization

Copying and pasting here the original *mlprodict*'s `OnnxWholeSession` class code as the installation of this package is failing on the latest version of the Colab runtime.

In [None]:
import json
import numpy

class OnnxWholeSession:
    """
    Runs the prediction for a single :epkg:`ONNX`,
    it lets the runtime handle the graph logic as well.

    :param onnx_data: :epkg:`ONNX` model or data
    :param runtime: runtime to be used, mostly :epkg:`onnxruntime`
    :param runtime_options: runtime options
    :param device: device, a string `cpu`, `cuda`, `cuda:0`...

    .. versionchanged:: 0.8
        Parameter *device* was added.
    """

    def __init__(self, onnx_data, runtime, runtime_options=None, device=None):
        if runtime not in ('onnxruntime1', 'onnxruntime1-cuda'):
            raise NotImplementedError(  # pragma: no cover
                f"runtime '{runtime}' is not implemented.")

        from onnxruntime import (  # delayed
            InferenceSession, SessionOptions, RunOptions,
            GraphOptimizationLevel)
        from onnxruntime.capi._pybind_state import (  # pylint: disable=E0611
            Fail as OrtFail, InvalidGraph as OrtInvalidGraph,
            InvalidArgument as OrtInvalidArgument,
            NotImplemented as OrtNotImplemented,
            RuntimeException as OrtRuntimeException)

        onnx_data0 = onnx_data
        if hasattr(onnx_data, 'SerializeToString'):
            onnx_data = onnx_data.SerializeToString()
        if isinstance(runtime_options, SessionOptions):
            sess_options = runtime_options
            session_options = None
            runtime_options = None
        else:
            session_options = (
                None if runtime_options is None
                else runtime_options.get('session_options', None))
            self.runtime = runtime
            sess_options = session_options or SessionOptions()
        self.run_options = RunOptions()
        self.run_options.log_severity_level = 3
        self.run_options.log_verbosity_level = 1

        if session_options is None:
            if runtime_options is not None:
                if runtime_options.get('disable_optimisation', False):
                    sess_options.graph_optimization_level = (  # pragma: no cover
                        GraphOptimizationLevel.ORT_ENABLE_ALL)
                if runtime_options.get('enable_profiling', True):
                    sess_options.enable_profiling = True
                if runtime_options.get('log_severity_level', 2) != 2:
                    v = runtime_options.get('log_severity_level', 2)
                    sess_options.log_severity_level = v
                    self.run_options.log_severity_level = v
        elif runtime_options is not None and 'enable_profiling' in runtime_options:
            raise RuntimeError(  # pragma: no cover
                "session_options and enable_profiling cannot be defined at the "
                "same time.")
        elif runtime_options is not None and 'disable_optimisation' in runtime_options:
            raise RuntimeError(  # pragma: no cover
                "session_options and disable_optimisation cannot be defined at the "
                "same time.")
        elif runtime_options is not None and 'log_severity_level' in runtime_options:
            raise RuntimeError(  # pragma: no cover
                "session_options and log_severity_level cannot be defined at the "
                "same time.")
        providers = ['CPUExecutionProvider']
        if runtime == 'onnxruntime1-cuda':
            providers = ['CUDAExecutionProvider'] + providers
        try:
            self.sess = InferenceSession(onnx_data, sess_options=sess_options,
                                         device=device, providers=providers)
        except (OrtFail, OrtNotImplemented, OrtInvalidGraph,
                OrtInvalidArgument, OrtRuntimeException, RuntimeError) as e:
            raise RuntimeError(
                "Unable to create InferenceSession due to '{}'\n{}.".format(e)) from e
        self.output_names = [_.name for _ in self.sess.get_outputs()]

    def run(self, inputs):
        """
        Computes the predictions.

        @param      inputs      dictionary *{variable, value}*
        @return                 list of outputs
        """
        v = next(iter(inputs.values()))
        if isinstance(v, (numpy.ndarray, dict)):
            try:
                return self.sess._sess.run(
                    self.output_names, inputs, self.run_options)
            except ValueError as e:
                raise ValueError(
                    "Issue running inference inputs=%r, expected inputs=%r."
                    "" % (
                        list(sorted(inputs)),
                        [i.name for i in self.sess.get_inputs()])) from e
        try:
            return self.sess._sess.run_with_ort_values(
                inputs, self.output_names, self.run_options)
        except RuntimeError:
            return self.sess._sess.run_with_ort_values(
                {k: v._get_c_value() for k, v in inputs.items()},
                self.output_names, self.run_options)

    @staticmethod
    def process_profiling(js):
        """
        Flattens json returned by onnxruntime profiling.

        :param js: json
        :return: list of dictionaries
        """
        rows = []
        for row in js:
            if 'args' in row and isinstance(row['args'], dict):
                for k, v in row['args'].items():
                    row[f'args_{k}'] = v
                del row['args']
            rows.append(row)
        return rows

    def get_profiling(self):
        """
        Returns the profiling informations.
        """
        prof = self.sess.end_profiling()
        with open(prof, 'r') as f:
            content = f.read()
        js = json.loads(content)
        return OnnxWholeSession.process_profiling(js)

Define a custom function to put the raw ONNX profiling data in a more friendly and useful format.

In [None]:
import json
import pandas as pd

def clean_up_profiling_data(prof):
  with open(prof, "r") as f:
      js = json.load(f)
  df = pd.DataFrame(OnnxWholeSession.process_profiling(js))

  return df

Define a custom function to do several profiling data aggregations (group by operator type and calculate the total duration for each one, count the number of occurrences for each one (and order them by duration), calculate the percentage of the total inference time for each one) that would be used to build some visualizations.

In [None]:
def transform_profiling_data_for_visualization(df):
  gr_dur = df[['dur', "args_op_name"]].groupby("args_op_name").sum().sort_values('dur')

  gr_n = df[['dur', "args_op_name"]].groupby("args_op_name").count().sort_values('dur')
  gr_n = gr_n.loc[gr_dur.index, :]

  gr_dur_perc = gr_dur / gr_dur['dur'].sum()

  return gr_dur, gr_n, gr_dur_perc

Transform the profiling data for the ONNX model.

In [None]:
gr_dur, gr_n, gr_dur_perc = transform_profiling_data_for_visualization(clean_up_profiling_data(prof))

Create visualizations for the ONNX model profiling data.

In [None]:
import plotly.express as px

fig = px.bar(gr_dur, x='dur',
             labels={
                     "dur": "Duration (ms)",
                     "args_op_name": "Operation type",
                 },
             title='Duration')
fig.show()

In [None]:
fig = px.bar(gr_n, x='dur',
             labels={
                     "dur": "Op count",
                     "args_op_name": "Operation type",
                 },
             title='Occurrences')
fig.show()

In [None]:
fig = px.bar(gr_dur_perc, x='dur',
             labels={
                     "dur": "Duration (%)",
                     "args_op_name": "Operation type",
                 },
             title='Proportion')
fig.show()

Transform the profiling data for the optimized ONNX model.

In [None]:
gr_dur, gr_n, gr_dur_perc = transform_profiling_data_for_visualization(clean_up_profiling_data(prof_optimized))

Create visualizations for the optimized ONNX model profiling data.

In [None]:
fig = px.bar(gr_dur, x='dur',
             labels={
                     "dur": "Duration (ms)",
                     "args_op_name": "Operation type",
                 },
             title='Duration')
fig.show()

In [None]:
fig = px.bar(gr_n, x='dur',
             labels={
                     "dur": "Op count",
                     "args_op_name": "Operation type",
                 },
             title='Occurrences')
fig.show()

In [None]:
fig = px.bar(gr_dur_perc, x='dur',
             labels={
                     "dur": "Duration (%)",
                     "args_op_name": "Operation type",
                 },
             title='Proportion')
fig.show()