# Model Optimization Tutorial

This tutorial describe the process of optimizing the user's model. The input to this tutorial is a HAR file in Hailo Model state (before optimization; with native weights) and the output will be a quantized HAR file with quantized weights.

Note: For full information about Optimization and Quantization, refer to the `Dataflow Compiler user guide / Model optimization` section.

**Requirements:**

* Run this code in Jupyter notebook. See the Introduction tutorial for more details.
* The user should review the complete Parsing Tutorial (or created the HAR file in other way)

**Recommendation:**

* To obtain best performance run this code with a GPU machine. For full information see the `Dataflow Compiler user guide / Model optimization` section.

**Contents:**

* Quick optimization tutorial
* In-depth optimization & evaluation tutorial
* Advanced Model Modifications tutorial
* Compression and Optimization levels

In [1]:
# General imports used throughout the tutorial
# file operations
import json
import os

import numpy as np
import tensorflow as tf
from IPython.display import SVG
from matplotlib import patches
from matplotlib import pyplot as plt
from PIL import Image
from tensorflow.python.eager.context import eager_mode

# import the hailo sdk client relevant classes
from hailo_sdk_client import ClientRunner, InferenceContext

%matplotlib inline

IMAGES_TO_VISUALIZE = 5

## Quick Optimization Tutorial

After the HAR file has been created (using either `runner.translate_tf_model` or `runner.translate_onnx_model`), the next step is to go through the optimization process.

The basic optimization is performed just by calling `runner.optimize(calib_dataset)` (or the CLI `hailo optimize` command), as described on the user guide on: Building Models / Model optimization / Model Optimization Workflow.
The calibration dataset should be preprocessed according to the model's input requirements and it is recommended to have at least 1024 inputs and to use a GPU.
During this step it is also possible to use a model script which change the default behavior of the Dataflow Compiler, for example, to add additional layer for normalization.
All the model script available commands are described in the user guide on: Building Models / Model optimization / Optimization Related Model Script Commands.

In order to learn how to deal with common pitfalls, image formats and accuracy, refer to the in-depth section.

In [4]:
# from ultralytics.utils.downloads import download
# # Download labels
# datadir = '../data'
# url = 'https://github.com/ultralytics/assets/releases/download/v0.0.0/coco2017labels.zip'  # labels
# # download(url, dir=datadir)

# url_img = 'http://images.cocodataset.org/zips/val2017.zip'
# download(url_img, dir=datadir)

Downloading http://images.cocodataset.org/zips/val2017.zip to '../data/val2017.zip'...


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 778M/778M [02:57<00:00, 4.58MB/s]
Unzipping ../data/val2017.zip to /local/workspace/hailo_virtualenv/lib/python3.10/site-packages/hailo_tutorials/data/val2017...: 100%|██████████| 5001/5001 [00:01<00:00, 4708.33file/s]


In [12]:
import torchvision as tv
import torch
# First, we will prepare the calibration set. Resize the images to the correct size and crop them.
def preproc(image, output_height=224, output_width=224, resize_side=256):
    """imagenet-standard: aspect-preserving resize to 256px smaller-side, then central-crop to 224px"""
    with eager_mode():
        h, w = image.shape[0], image.shape[1]
        scale = tf.cond(tf.less(h, w), lambda: resize_side / h, lambda: resize_side / w)
        resized_image = tf.compat.v1.image.resize_bilinear(tf.expand_dims(image, 0), [int(h * scale), int(w * scale)])
        cropped_image = tf.compat.v1.image.resize_with_crop_or_pad(resized_image, output_height, output_width)

        return tf.squeeze(cropped_image)
    
def preproc_yolo(image, output_height=640, output_width=640):
    preprocess = tv.transforms.Compose([
        tv.transforms.Resize((output_height, output_width)),
        tv.transforms.ToTensor(),
        #tv.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    
    return preprocess(image)


images_path = "../data/coco/images/val2017" # "../data"
images_list = [img_name for img_name in os.listdir(images_path) if os.path.splitext(img_name)[1] == ".jpg"]
calib_dataset = np.zeros((len(images_list), 640, 640, 3))
for idx, img_name in enumerate(sorted(images_list)):
    img = Image.open(os.path.join(images_path, img_name)).convert('RGB')
    #img_preproc = preproc(img)
    img_preproc = preproc_yolo(img)
    img_preproc = torch.reshape(img_preproc, (640,640,3))
    calib_dataset[idx, :, :, :] = img_preproc.numpy()

np.save("calib_set.npy", calib_dataset)

In [13]:
# Second, we will load our parsed HAR from the Parsing Tutorial

model_name = "yolo11n"
hailo_model_har_name = f"{model_name}_hailo_model.har"
assert os.path.isfile(hailo_model_har_name), "Please provide valid path for HAR file"
runner = ClientRunner(har=hailo_model_har_name)
# By default it uses the hw_arch that is saved on the HAR. For overriding, use the hw_arch flag.

In [17]:
# Now we will create a model script, that tells the compiler to add a normalization on the beginning
# of the model (that is why we didn't normalize the calibration set;
# Otherwise we would have to normalize it before using it)

# Batch size is 8 by default
alls =  """
normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])
change_output_activation(conv54, sigmoid)
change_output_activation(conv65, sigmoid)
change_output_activation(conv80, sigmoid)
nms_postprocess("./yolo11n_nms_config.json", meta_arch=yolov8, engine=cpu)
allocator_param(width_splitter_defuse=disabled)
 """

# Load the model script to ClientRunner so it will be considered on optimization
runner.load_model_script(alls)

# Call Optimize to perform the optimization process
runner.optimize(calib_dataset)

# Save the result state to a Quantized HAR file
quantized_model_har_path = f"{model_name}_quantized_model.har"
runner.save_har(quantized_model_har_path)

[info] Loading model script commands to yolo11n from string
[info] Starting Model Optimization
[info] Using default optimization level of 2
[info] Model received quantization params from the hn
[info] MatmulDecompose skipped
[info] Starting Mixed Precision
[info] Model Optimization Algorithm Mixed Precision is done (completion time is 00:00:00.42)
[info] LayerNorm Decomposition skipped
[info] Starting Statistics Collector
[info] Using dataset with 64 entries for calibration


Calibration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64/64 [00:24<00:00,  2.59entries/s]


[info] Model Optimization Algorithm Statistics Collector is done (completion time is 00:00:26.38)
[info] Starting Fix zp_comp Encoding
[info] Model Optimization Algorithm Fix zp_comp Encoding is done (completion time is 00:00:00.00)
[info] Starting Matmul Equalization
[info] Model Optimization Algorithm Matmul Equalization is done (completion time is 00:00:00.02)
[info] No shifts available for layer yolo11n/conv1/conv_op, using max shift instead. delta=3.0541
[info] No shifts available for layer yolo11n/conv1/conv_op, using max shift instead. delta=1.5271
[info] activation fitting started for yolo11n/reduce_sum_softmax1/act_op
[info] Finetune encoding skipped
[info] Bias Correction skipped
[info] Adaround skipped
[info] Starting Quantization-Aware Fine-Tuning
[info] Using dataset with 1024 entries for finetune
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
[info] Model Optimization Algorithm Quantization-Aware Fine-Tuning is done (completion time is 00:03:23.14)
[info] Starting Layer Noise An

Full Quant Analysis: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:53<00:00, 56.83s/iterations]


[info] Model Optimization Algorithm Layer Noise Analysis is done (completion time is 00:01:55.76)
[info] Model Optimization is done
[info] Saved HAR to: /local/workspace/hailo_virtualenv/lib/python3.10/site-packages/hailo_tutorials/notebooks/yolo11n_quantized_model.har


That concludes the quick tutorial.

## In-Depth Optimization Tutorial

The advanced optimization process (see the diagram in the user guide on: Building Models / Model optimization / Model Optimization Workflow), is comprised of the following steps:

1. Test the parsed `Native model` before any changes are made (still on floating point precision), check to see that the pre and post processing code works well with the start and end nodes provided. The Native model will match the results of the original model, in between the start_node_names and the end_node_names provided by the user during the Parsing stage.
2. Optional: Apply Model Modifications (like input Normalization layer, YUY2 to RGB conversion, changing output activations and others), using a `model script`.
3. Test the `FP Optimized model` (the model after floating point operations and modifications) to see that required results have been achieved.
    - Note: Remember to update the pre and post processing code to match the changes in the model. For example, if normalization has been added to the model, remove the normalization code from the pre-processing code, and feed un-normalized images to the model. If softmax has been added onto the outputs, remove the softmax from the post-processing code. Etc.
4. Now perform `Optimization` to the model, using a calibration set that has been prepared. The result is a `Quantized model`, that has some degradation compared to the pre-quantized model.
    - Note: The format of calibration set is the same as was used as inputs for the modified model. For example, if a normalization layer has been added to the model, the calibration set should not be normalized. If this layer has not been added yet, pre-process and normalize the images.
5. Test the quantized model using the same already-validated code for the pre and post processing.
    - If there is a degradation, it is due to the quantization process and not due to input/output formats, as they were already verified with the pre-quantized model.
6. To increase the accuracy of the quantized model, it is possible to optimize again using a `model script` to affect the optimization process.
    - Note: The most basic method is to raise the optimization_level, an example model script command is `model_optimization_flavor(optimization_level=4)`. The advanced method is to use the Layer Analysis Tool, presented on the next tutorial.
    - Note: If the accuracy is good, consider increasing the performance by using 4-bit weights. This is done using compression_level, an example model script command is `model_optimization_flavor(compression_level=2)`.
7. During the next tutorials, compilation and on-device inference, input and output values are expected to be similar to the quantized model's values.

The testing (whether on Native, Modified or Quantized model) is performed using our `Emulator` feature, that will be described in this tutorial.

To further understand the advanced optimization process, the following steps are described below.


--------
### Preliminary step: Create testing environment

Hailo offers an `Emulator` for testing the model in its different states. The emulator is implemented as a Tensorflow graph, and its results are the return value of `runner.infer(context, network_input)`. To get inference results, run this API within the context manager `runner.infer_context(inference_context)` where the inference context is one of: [`InferenceContext.SDK_NATIVE`, `InferenceContext.SDK_FP_OPTIMIZED`, `InferenceContext.SDK_QUANTIZED`]:
- `InferenceContext.SDK_NATIVE`: Testing method for Step 1 of the optimization process steps (`Native model`). Runs the model as is without any changes. Use it to make sure the model has been converted properly into Hailo's internal representation. Should yield exact results as the original model.
- `InferenceContext.SDK_FP_OPTIMIZED`: Testing method for Step 3 of the optimization process steps (`Modified model`). The modified model represents the Hailo model prior to quantization, and is the result of performing model modifications (e.g. normalizing/resizing inputs) and full precision optimizations (e.g. tiled squeeze & excite, equalization). As a result, inference results may vary slightly from the native results.
- `InferenceContext.SDK_QUANTIZED`: Testing method for Step 5 of the optimization process steps (`Quantized model`). This inference context emulates the hardware implementation, and is useful for measuring the overall accuracy and degradation of the quantized model. This measurement is performed against the original model over large datasets, prior to running inference on the actual Hailo device.


------------
### Preliminary Step: Create Pre and Post Processing Functions

In [2]:
# -----------------------------------------
# Pre processing (prepare the input images)
# -----------------------------------------
# def preproc(image, output_height=224, output_width=224, resize_side=256, normalize=False):
#     """imagenet-standard: aspect-preserving resize to 256px smaller-side, then central-crop to 224px"""
#     with eager_mode():
#         h, w = image.shape[0], image.shape[1]
#         scale = tf.cond(tf.less(h, w), lambda: resize_side / h, lambda: resize_side / w)
#         resized_image = tf.compat.v1.image.resize_bilinear(tf.expand_dims(image, 0), [int(h * scale), int(w * scale)])
#         cropped_image = tf.compat.v1.image.resize_with_crop_or_pad(resized_image, output_height, output_width)

#         if normalize:
#             # Default normalization parameters for ImageNet
#             cropped_image = (cropped_image - [123.675, 116.28, 103.53]) / [58.395, 57.12, 57.375]

#         return tf.squeeze(cropped_image)
    
def preproc(image, output_height=640, output_width=640, normalize=False):
    preprocess = tv.transforms.Compose([
    tv.transforms.Resize((output_height, output_width)),
    tv.transforms.ToTensor(),
    ])
    
    if normalize:
        preprocess = tv.transforms.Compose([
            tv.transforms.Resize((output_height, output_width)),
            tv.transforms.ToTensor(),
            # Default normalization parameters for coco
            tv.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ])
    
    return preprocess(image)


# -----------------------------------------------------
# Post processing (what to do with the model's outputs)
# -----------------------------------------------------
def _get_imagenet_labels(json_path="../data/imagenet_names.json"):
    imagenet_names = json.load(open(json_path))
    imagenet_names = [imagenet_names[str(i)] for i in range(1001)]
    return imagenet_names[1:]


imagenet_labels = _get_imagenet_labels()


# def postproc(results):
#     labels = []
#     scores = []
#     results = [np.squeeze(result) for result in results]
#     for result in results:
#         top_ind = np.argmax(result)
#         cur_label = imagenet_labels[top_ind]
#         cur_score = 100 * result[top_ind]
#         labels.append(cur_label)
#         scores.append(cur_score)
#     return scores, labels

# import functions from previous notebook
def postproc(output, conf_threshold=0.25, iou_threshold=0.45):
  # Calculate scale factor, length / 640 (or whatever img size is defined)
  scale = 1 # this can be one here, since length == longest side, and it's a square

  # Prepare output array
  # outputs = np.array([output[0]])
  # Prepare output array
  ort_outs = np.array([cv2.transpose(output[0])])
  rows = ort_outs.shape[1]

  boxes = []
  scores = []
  class_ids = []
  # Iterate through output to collect bounding boxes, confidence scores, and class IDs
  for i in range(rows):
      classes_scores = ort_outs[0][i][4:]
      (minScore, maxScore, minClassLoc, (x, maxClassIndex)) = cv2.minMaxLoc(classes_scores)
      if maxScore >= 0.25:
          box = [
              ort_outs[0][i][0] - (0.5 * ort_outs[0][i][2]),
              ort_outs[0][i][1] - (0.5 * ort_outs[0][i][3]),
              ort_outs[0][i][2],
              ort_outs[0][i][3],
          ]
          boxes.append(box)
          scores.append(maxScore)
          class_ids.append(maxClassIndex)

  # Apply NMS (Non-maximum suppression)
  ort_boxes = cv2.dnn.NMSBoxes(boxes, scores, 0.25, 0.45, 0.5)

  detections = []
  finalboxes = []
  # Iterate through NMS results to draw bounding boxes and labels
  for i in range(len(ort_boxes)):
      index = ort_boxes[i]
      box = boxes[index]
      detection = {
          "class_id": class_ids[index],
          "class_name": CLASSES[class_ids[index]],
          "confidence": scores[index],
          "box": box,
          "scale": scale,
      }

      detections.append(detection)
      box_plot = [
          class_ids[index],
          scores[index],
          round(box[0] * scale),
          round(box[1] * scale),
          round((box[0] + box[2]) * scale),
          round((box[1] + box[3]) * scale),
      ]
      finalboxes.append(box_plot)
  return detections, finalboxes


# -------------
# Visualization
# -------------
def mynorm(data):
    return (data - np.min(data)) / (np.max(data) - np.min(data))


def visualize_results(
    images,
    first_scores,
    first_labels,
    second_scores=None,
    second_labels=None,
    first_title="Full Precision",
    second_title="Other",
):
    # Deal with input arguments
    assert (second_scores is None and second_labels is None) or (
        second_scores is not None and second_labels is not None
    ), "second_scores and second_labels must both be supplied, or both not be supplied"
    assert len(images) == len(first_scores) == len(first_labels), "lengths of inputs must be equal"

    show_only_first = second_scores is None
    if not show_only_first:
        assert len(images) == len(second_scores) == len(second_labels), "lengths of inputs must be equal"

    # Display
    for img_idx in range(len(images)):
        plt.figure()
        plt.imshow(mynorm(images[img_idx]))

        if not show_only_first:
            plt.title(
                f"{first_title}: top-1 class is {first_labels[img_idx]}. Confidence is {first_scores[img_idx]:.2f}%,\n"
                f"{second_title}: top-1 class is {second_labels[img_idx]}. Confidence is {second_scores[img_idx]:.2f}%",
            )
        else:
            plt.title(
                f"{first_title}: top-1 class is {first_labels[img_idx]}. Confidence is {first_scores[img_idx]:.2f}%",
            )

---------------
### Step 1: Test Native Model

Load the network to the ClientRunner from the saved Hailo Archive file:

In [3]:
model_name = "yolo11n"
hailo_model_har_name = f"{model_name}_hailo_model.har"
assert os.path.isfile(hailo_model_har_name), "Please provide valid path for HAR file"
runner = ClientRunner(har=hailo_model_har_name)
# By default it uses the hw_arch that is saved on the HAR. For overriding, use the hw_arch flag.

In [None]:
import torch
import torchvision as tv

images_path = "../data/coco/images/val2017" # "../data"
images_list = [img_name for img_name in os.listdir(images_path) if os.path.splitext(img_name)[1] == ".jpg"]

# Create an un-normalized dataset for visualization
image_dataset = np.zeros((len(images_list), 640, 640, 3))
# Create a normalized dataset to feed into the Native emulator
image_dataset_normalized = np.zeros((len(images_list), 640, 640, 3))
for idx, img_name in enumerate(sorted(images_list)):
    img = Image.open(os.path.join(images_path, img_name)).convert('RGB')
    img_preproc = preproc(img)
    img_preproc = torch.reshape(img_preproc, (640,640,3))
    image_dataset[idx, :, :, :] = img_preproc.numpy()

print('Proprocessing done\nRunning preproc with normalization...')
# run again for normalised data
for idx, img_name in enumerate(sorted(images_list)):
    img = Image.open(os.path.join(images_path, img_name)).convert('RGB')
    img_preproc_norm = preproc(img, normalize=True)
    img_preproc_norm = torch.reshape(img_preproc_norm, (640,640,3))
    image_dataset_normalized[idx, :, :, :] = img_preproc_norm.numpy()

Proprocessing done
Running preproc with normalization...


Now call the `Native` emulator:

In [7]:
# Notice that we use the normalized images, because normalization is not in the model
with runner.infer_context(InferenceContext.SDK_NATIVE) as ctx:
    native_res = runner.infer(ctx, image_dataset_normalized[:IMAGES_TO_VISUALIZE, :, :, :])

native_scores, native_labels = postproc(native_res)
visualize_results(image_dataset[:IMAGES_TO_VISUALIZE, :, :, :], native_scores, native_labels)

[info] Using 1 GPU for inference


Inference: 8entries [00:11,  1.39s/entries]


IndexError: list index out of range

### Steps 2,3: Apply Model Modifications, and Test Modified Model

The `Model Script` is a text file that includes `model script commands`, affecting the stages of the compiler.

In the next steps the following will be performed:
- Create a model script for the Optimization process, that also includes the model modifications.
- Load the model script (it wont be applied yet)
- Call runner.optimize_full_precision() to apply the model modifications (instead, we could call optimize() that also applies the model modifications)
- Then we could call the SDK_FP_OPTIMIZED emulation context

In [None]:
model_script_lines = [
    # Add normalization layer with mean [123.675, 116.28, 103.53] and std [58.395, 57.12, 57.375])
    "normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.375])\n",
    # For multiple input nodes:
    # {normalization_layer_name_1} = normalization([list of means per channel], [list of stds per channel], {input_layer_name_1_from_hn})\n',
    # {normalization_layer_name_2} = normalization([list of means per channel], [list of stds per channel], {input_layer_name_2_from_hn})\n',
    # ...
]

# Load the model script to ClientRunner so it will be considered on optimization
runner.load_model_script("".join(model_script_lines))
runner.optimize_full_precision()

In [None]:
# Notice that we use the original images, because normalization is IN the model
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
    modified_res = runner.infer(ctx, image_dataset[:IMAGES_TO_VISUALIZE, :, :, :])

modified_scores, modified_labels = postproc(modified_res)

visualize_results(
    image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
    native_scores,
    native_labels,
    modified_scores,
    modified_labels,
    second_title="FP Modified",
)

### Step 4,5: Optimize the Model and Test its Accuracy

1. We will create a calibration dataset (will be the same as the input to the modified model)
2. Then we will call Optimize
3. Then we will test its accuracy vs. the modified model. Please note that the quantized emulator is not bit-exact with the Hailo hardware but provides good and fast approximation.

In [None]:
# The original images are being used, just as the input to the SDK_FP_OPTIMIZED emulator
calib_dataset = image_dataset

# For calling Optimize, use the short version: runner.optimize(calib_dataset)
# A more general approach is being used here that works also with multiple input nodes.
# The calibration dataset could also be a dictionary with the format:
# {input_layer_name_1_from_hn: layer_1_calib_dataset, input_layer_name_2_from_hn: layer_2_calib_dataset}
hn_layers = runner.get_hn_dict()["layers"]
print("Input layers are: ")
print([layer for layer in hn_layers if hn_layers[layer]["type"] == "input_layer"])  # See available input layer names
calib_dataset_dict = {"resnet_v1_18/input_layer1": calib_dataset}  # In our case there is only one input layer
runner.optimize(calib_dataset_dict)

In [None]:
# Notice that we use the original images, because normalization is in the model
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    quantized_res = runner.infer(ctx, image_dataset[:IMAGES_TO_VISUALIZE, :, :, :])

quantized_scores, quantized_labels = postproc(quantized_res)

visualize_results(
    image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
    modified_scores,
    modified_labels,
    quantized_scores,
    quantized_labels,
    first_title="FP Modified",
    second_title="Quantized",
)

In [None]:
# Let's save the runner's state to a Quantized HAR
quantized_model_har_path = f"{model_name}_quantized_model.har"
runner.save_har(quantized_model_har_path)

### Multiple Gpu Examples of inference
*This Demo depends on multiple gpu availability*
Further information for utilizing multiple GPUs is available on the `Dataflow Compiler user guide / Model optimization` section



In [None]:
num_gpus = len(tf.config.list_physical_devices("GPU"))
if num_gpus > 1:
    with runner.infer_context(InferenceContext.SDK_NATIVE, gpu_policy="model_parallelization") as ctx:
        native_res = runner.infer(ctx, image_dataset_normalized)

    with runner.infer_context(InferenceContext.SDK_QUANTIZED, gpu_policy="data_parallelization") as ctx:
        native_res = runner.infer(ctx, image_dataset_normalized)
else:
    print(f"To run this cell at least two gpus are needed, there are only {num_gpus} available")

### Step 6: How to Raise Accuracy 
To increase the accuracy of the quantized model, optimize again using a model script to affect the optimization process.

There are several tools that can be used.

* Verify that there is a GPU with at least 1024 images in the calibration set
* Raise the optimization_level value using the model_optimization_flavor command. If it fails on high GPU memory, try lowering the batch_size as described on the last example
* Decrease the compression_level value using the model_optimization_flavor command (default is 0, lowest option)
* Set the output layer(s) to use 16-bit accuracy using the command quantization_param(output_layer_name, precision_mode=a16_w16). Note that the DFC will set 16-bit output automatically for small enough outputs.
* Use the Layer Noise Analysis tools to find layers with low SNR, and affect their quantization using weight or activation clipping (see the next tutorial)
* Experiment with the FineTune parameters (refer to the user guide for more details)

For more information refer the user guide in: Building Models / Model optimization / Model Optimization Workflow / Debugging accuracy.

This completes the in-depth optimization tutorial.

## Advanced Model Modifications Tutorial
### Adding on-chip input format conversion through model script commands 
This block will apply model modification commands using a model script. A [YUY2](https://en.wikipedia.org/wiki/YUV)-> [YUV](https://en.wikipedia.org/wiki/YUV)-> RGB conversion will be added.

Unlike the normalization layer, which could simulate with the SDK_FP_OPTIMIZED and SDK_QUANTIZED emulators, not all format conversions are supported in the emulator (for more information see the `Dataflow Compiler user guide / Model optimization` section). 
Every conversion that runs in the emulator affects the calibration set, and the user should supply the set accordingly. 
For example, after adding YUV -> RGB format conversion layer, the calibration set is expected to be in YUV format. 
However, for some conversions the user may choose to skip the conversion in emulation and to use the original calibration set instead. 
For instance, in this tutorial we will use YUY2 -> YUV layer without emulation because we want the emulator input and the calibration dataset to remain in YUV format. 
The format conversion layer would be relevant only when running the compiled .hef file on device.

Note: The NV21 -> YUV conversion is not supported in emulation.

The steps are:

1) Initialize Client Runner
2) Load YUV dataset
3) Load model script with the relevant commands
4) Using the optimize() API, the commands are applied and the model is quantized
5) Usage:
  * To create input conversion after a specific layer: yuv_to_rgb_layer = input_conversion(input_layer1, yuv_to_rgb)
  * To include the conversion in the optimization process: yuv_to_rgb_layer = input_conversion(input_layer1, yuv_to_rgb, emulator_support=True)
  * To create input conversion after all input layers: net_scope1/yuv2rgb1, net_scope2/yuv2rgb2 = input_conversion(yuv_to_rgb)

In [None]:
# Let's load the original parsed model again
model_name = "resnet_v1_18"
hailo_model_har_name = f"{model_name}_hailo_model.har"
assert os.path.isfile(hailo_model_har_name), "Please provide valid path for HAR file"
runner = ClientRunner(har=hailo_model_har_name)

# We are using a pre-made YUV calibration set
calib_dataset_yuv = np.load("../model_modifications/calib_dataset_yuv.npz")

# Now we're adding yuy2_to_yuv conversion before the yuv_to_rgb and a normalization layer.
# The order of the layers is determined by the order of the commands in the model script:
# First we add normalization to the original input layer -> the input to the network is now normalization1
# Then we add yuv_to_rgb layer, so the order will be: yuv_to_rgb1->normalization1->original_network
# Lastly, we add yuy2_to_yuv layer, so the order will be: yuy2_to_yuv1->yuv_to_rgb1->normalization1->original_network
model_script_commands = [
    "normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.375])\n",
    "yuv_to_rgb1 = input_conversion(yuv_to_rgb)\n",
    "yuy2_to_yuv1 = input_conversion(input_layer1, yuy2_to_hailo_yuv)\n",
]
runner.load_model_script("".join(model_script_commands))

# Notice that we don't have to call runner.optimize_full_precision(), its only an intermediate step
# to be able to use SdkFPOptimize emulator before Optimization.
runner.optimize(calib_dataset_yuv["yuv_dataset"])

modified_model_har_name = f"{model_name}_modified.har"
runner.save_har(modified_model_har_name)

!hailo visualizer {hailo_model_har_name} --no-browser
SVG("resnet_v1_18.svg")

### Adding On-chip Input Resize Through Model Script Commands
This block will apply on-chip bilinear image resize at the beginning of the network through model script commands:

* Create a bigger (640x480) calibration set out of the Imagenet dataset
* Initialize Client Runner
* Load the new calibration set
* Load the model script with the resize command
* Using the optimize() API, the command is applied and the model is quantized

In [None]:
images_path = "../data"
images_list = [img_name for img_name in os.listdir(images_path) if os.path.splitext(img_name)[1] == ".jpg"]

idx_to_visualize = None
images_list = images_list[:64]
calib_dataset_new = np.zeros((len(images_list), 480, 640, 3))
for idx, img_name in enumerate(images_list):
    img = Image.open(os.path.join(images_path, img_name))
    resized_image = np.array(img.resize((640, 480), Image.Resampling.BILINEAR))
    calib_dataset_new[idx, :, :, :] = resized_image
    # find an image that will be nice to display
    if idx_to_visualize is None and img.size[0] != 640:
        idx_to_visualize = idx
        img_to_show = img


np.save("calib_set_480_640.npy", calib_dataset_new)
plt.imshow(img_to_show)
plt.title("Original image")
plt.show()
plt.imshow(np.array(calib_dataset_new[idx_to_visualize, :, :, :], np.uint8))
plt.title("Resized image")
plt.show()

In [None]:
model_name = "resnet_v1_18"
hailo_model_har_name = f"{model_name}_hailo_model.har"
assert os.path.isfile(hailo_model_har_name), "Please provide valid path for HAR file"
runner = ClientRunner(har=hailo_model_har_name)

calib_dataset_large = np.load("calib_set_480_640.npy")

# Add a bilinear resize from 480x640 to the network's input size - in this case, 224x224.
# The order of the layers is determined by the order of the commands in the model script:
# First we add normalization to the original input layer -> the input to the network is now normalization1
# Then we add resize layer, so the order will be: resize_input1->normalization1->original_network
model_script_commands = [
    "normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.375])\n",
    "resize_input1= resize(resize_shapes=[480,640])\n",
]

runner.load_model_script("".join(model_script_commands))
calib_dataset_dict = {"resnet_v1_18/input_layer1": calib_dataset_large}  # In our case there is only one input layer
runner.optimize(calib_dataset_dict)

modified_model_har_name = f"{model_name}_resized.har"
runner.save_har(modified_model_har_name)

!hailo visualizer {hailo_model_har_name} --no-browser
SVG("resnet_v1_18.svg")

### Adding Non-Maximum Suppression (NMS) Layer Through Model Script Commands
This block will add an NMS layer at the end of the network through the model script command: `nms_postprocess`. The following arguments can be used to:

* Config json: an external json file that allows the changing of the NMS parameters (can be skipped for the default configuration).
* Meta architecture: which meta architecture to use (for example, `yolov5`, `ssd`, etc). In this example, `yolov5` will be used.
* Engine: defines the inference device for running the nms: `nn_core`, `cpu` or `auto` (this example shows `cpu`).

Usage:

* Initialize Client Runner
* Translate a YOLOv5 model
* Load the model script with the NMS command
* Use the `optimize_full_precision()` API to apply the command (Note that `optimize()` API can also be used)
* Display inference result


In [None]:
model_name = "yolov5s"
onnx_path = f"../models/{model_name}.onnx"
assert os.path.isfile(onnx_path), "Please provide valid path for ONNX file"

# Initialize a new client runner
runner = ClientRunner(hw_arch="hailo8")
# Any other hw_arch can be used as well.

# Translate YOLO model from ONNX
runner.translate_onnx_model(onnx_path, end_node_names=["Conv_298", "Conv_248", "Conv_198"])
# Note: NMS will be detected automatically, with a message that contains:
#   - 'original layer name': {'w': [WIDTHS], 'h': [HEIGHTS], 'stride': STRIDE, 'encoded_layer': TRANSLATED_LAYER_NAME}
# Use nms_postprocess(meta_arch=yolov5) to add the NMS.

# Add model script with NMS layer at the network's output.
model_script_commands = [
    "normalization1 = normalization([0.0, 0.0, 0.0], [255.0, 255.0, 255.0])\n",
    "resize_input1= resize(resize_shapes=[480,640])\n",
    "nms_postprocess(meta_arch=yolov5, engine=cpu, nms_scores_th=0.2, nms_iou_th=0.4)\n",
]
# Note: Scores threshold of 0.0 means no filtering, 1.0 means maximal filtering. IoU thresholds are opposite: 1.0 means filtering boxes only if they are equal, and 0.0 means filtering with minimal overlap.
runner.load_model_script("".join(model_script_commands))

# Apply model script changes
runner.optimize_full_precision()

# Infer an image with the Hailo Emulator
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
    nms_output = runner.infer(ctx, calib_dataset_new[:16, ...])
HEIGHT = 480
WIDTH = 640
# For each image
for i in range(16):
    found_any = False
    min_score = None
    max_score = None
    # Go over all classes
    for class_index in range(nms_output.shape[1]):
        score, box = nms_output[i][class_index, 4, :], nms_output[i][class_index, 0:4, :]
        # Go over all detections
        for detection_idx in range(box.shape[1]):
            cur_score = score[detection_idx]
            # Discard null detections (because the output tensor is always padded to MAX_DETECTIONS on the emulator interface.
            # Note: On HailoRT APIs (that are used on the Inference Tutorial, and with C++ APIs), the default is a list per class. For more information look for NMS on the HailoRT user guide.
            if cur_score == 0:
                continue

            # Plotting code
            if not found_any:
                found_any = True
                fig, ax = plt.subplots()
                ax.imshow(Image.fromarray(np.array(calib_dataset_new[i], np.uint8)))
            if min_score is None or cur_score < min_score:
                min_score = cur_score
            if max_score is None or cur_score > max_score:
                max_score = cur_score
            (
                y_min,
                x_min,
            ) = box[0, detection_idx] * HEIGHT, box[1, detection_idx] * WIDTH
            y_max, x_max = box[2, detection_idx] * HEIGHT, box[3, detection_idx] * WIDTH
            center, width, height = (x_min, y_min), x_max - x_min, y_max - y_min
            # draw the box on the input image
            rect = patches.Rectangle(center, width, height, linewidth=1, edgecolor="r", facecolor="none")
            ax.add_patch(rect)

    if found_any:
        plt.title(f"Plot of high score boxes. Scores between {min_score:.2f} and {max_score:.2f}")
        plt.show()

## Advanced Optimization - Compression and Optimization Levels

For aggressive quantization (compress significant amount of weights to 4-bits), a higher optimization level will be needed to obtain good result.  

For quick iterations it is always recommended to start with the default setting of the model optimizer (optimization_level=2, compression_level=1). However, when moving to production, it is recommended to work at the highest complexity level to achieve optimal accuracy. With regards to compression, users should increase it when the overall throughput/latency of the model is not good enough.  
Note that increasing compression would have negligible effect on power-consumption so the motivation to work with higher compression level is mainly due to FPS considerations.


Here the compression level is set to 4 (which means ~80% of the weights will be quantized into 4-bits) using the compression_level param in a model script and run the model optimization again. Using 4-bit weights might reduce the model's accuracy but will help to reduce the model's memory footprint. In this example, it can be seen that the reliability of some examples decreases after changing several layers to 4-bit weights, later the reliability will improve after applying higher optimization_level.

In [None]:
alls_lines = [
    "normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.375])\n",
    # Batch size is 8 by default; 2 was used for stability on PCs with low amount of RAM / VRAM
    "model_optimization_flavor(optimization_level=0, compression_level=4, batch_size=2)\n",
    # The following line is needed because resnet_v1_18 is a really small model, and the compression_level is always reverted back to 0.'
    # To force using compression_level with small models, the following line should be used (compression level=4 equals to 80% 4-bit):
    "model_optimization_config(compression_params, auto_4bit_weights_ratio=0.8)\n",
    # The application of the compression could be seen by the [info] messages: "Assigning 4bit weight to layer .."
]
# -- Reduces weights memory by 80% !

runner = ClientRunner(har=hailo_model_har_name)

runner.load_model_script("".join(alls_lines))
runner.optimize(calib_dataset)

In [None]:
images = calib_dataset[:IMAGES_TO_VISUALIZE, :, :, :]
with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
    modified_res = runner.infer(ctx, images)
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    quantized_res = runner.infer(ctx, images)

modified_scores, modified_labels = postproc(modified_res)
quantized_scores, quantized_labels = postproc(quantized_res)

visualize_results(
    image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
    modified_scores,
    modified_labels,
    quantized_scores,
    quantized_labels,
    first_title="FP Modified",
    second_title="Quantized",
)

Now, repeating the same process with higher optimization level (For full information see the `Dataflow Compiler user guide / Model optimization` section):

In [None]:
images = calib_dataset[:IMAGES_TO_VISUALIZE, :, :, :]

alls_lines = [
    "normalization1 = normalization([123.675, 116.28, 103.53], [58.395, 57.12, 57.375])\n",
    # Batch size is 8 by default; 2 was used for stability on PCs with low amount of RAM / VRAM
    "model_optimization_flavor(optimization_level=2, compression_level=4, batch_size=2)\n",
    # The following line is needed because resnet_v1_18 is a really small model, and the compression_level is always reverted back to 0.'
    # To force using compression_level with small models, the following line should be used (compression level=4 equals to 80% 4-bit):
    "model_optimization_config(compression_params, auto_4bit_weights_ratio=0.8)\n",
    # The application of the compression could be seen by the [info] messages: "Assigning 4bit weight to layer .."
]
# -- Reduces weights memory by 80% !

runner = ClientRunner(har=hailo_model_har_name)
runner.load_model_script("".join(alls_lines))
runner.optimize(calib_dataset)

with runner.infer_context(InferenceContext.SDK_FP_OPTIMIZED) as ctx:
    modified_res = runner.infer(ctx, images)
with runner.infer_context(InferenceContext.SDK_QUANTIZED) as ctx:
    quantized_res = runner.infer(ctx, images)

modified_scores, modified_labels = postproc(modified_res)
quantized_scores_new, quantized_labels_new = postproc(quantized_res)

visualize_results(
    image_dataset[:IMAGES_TO_VISUALIZE, :, :, :],
    modified_scores,
    modified_labels,
    quantized_scores_new,
    quantized_labels_new,
    first_title="FP Modified",
    second_title="Quantized",
)

In [None]:
print(
    f"Full precision predictions:                        {modified_labels}\n"
    f"Quantized predictions (with optimization_level=2): {quantized_labels_new} "
    f"({sum(np.array(modified_labels) == np.array(quantized_labels_new))}/{len(modified_labels)})\n"
    f"Quantized predictions (with optimization_level=0): {quantized_labels} "
    f"({sum(np.array(modified_labels) == np.array(quantized_labels))}/{len(modified_labels)})",
)

Finally, save the optimized model to a Hailo Archive file:

In [None]:
runner.save_har(quantized_model_har_path)