# EfficientNet Lite - Deployment with MERA software stack

Model details: [EfficientNet paper](https://arxiv.org/abs/1905.11946).

|**Model** | **params** | **MAdds** | **FP32 accuracy** | **FP32 CPU  latency** | **FP32 GPU latency** | **FP16 GPU latency** |**INT8 accuracy** | **INT8 CPU latency**  | **INT8 TPU latency**|
|------|-----|-------|-------|-------|-------|-------|-------|-------|-------|
|efficientnet-lite0 [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite0.tar.gz) | 4.7M | 407M |  75.1% |  12ms | 9.0ms | 6.0ms  | 74.4% |  6.5ms | 3.8ms |
|efficientnet-lite1 [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite1.tar.gz) | 5.4M | 631M |  76.7% |  18ms | 12ms | 8.0ms  |  75.9% | 9.1ms | 5.4ms |
|efficientnet-lite2 [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite2.tar.gz) | 6.1M | 899M |  77.6% |  26ms | 16ms | 10ms | 77.0% | 12ms | 7.9ms |
|efficientnet-lite3 [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite3.tar.gz) | 8.2M | 1.44B |  79.8% |  41ms | 23ms | 14ms  | 79.0% | 18ms | 9.7ms |
|efficientnet-lite4 [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite4.tar.gz) |13.0M | 2.64B |  81.5% |  76ms | 36ms | 21ms  | 80.2% | 30ms | - |

* CPU/GPU/TPU latency are measured on Pixel4, with batch size 1 and 4 CPU threads. FP16 GPU latency is measured with default latency, while FP32 GPU latency is measured with additional option --gpu_precision_loss_allowed=false.

Original repository: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite

## Deployment guide
### Download the models:

In [1]:
%%capture
!wget https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite1.tar.gz
!tar xvzf efficientnet-lite1.tar.gz
!rm efficientnet-lite1.tar.gz
!cp efficientnet-lite1/efficientnet-lite1-int8.tflite effnet-lite1.tflite
!rm -rf efficientnet-lite1/

!wget https://storage.googleapis.com/cloud-tpu-checkpoints/efficientnet/lite/efficientnet-lite4.tar.gz
!tar xvzf efficientnet-lite4.tar.gz
!rm efficientnet-lite4.tar.gz
!cp efficientnet-lite4/efficientnet-lite4-int8.tflite effnet-lite4.tflite
!rm -rf efficientnet-lite4/

### Basic imports

In [2]:
import numpy as np
import tensorflow as tf

import mera
from mera import Target
from mera import Platform

### Load image helper

In [3]:
def load_image(image_path, input_size):
    image = tf.io.read_file(image_path)
    image = tf.image.decode_jpeg(image)
    input_image = tf.expand_dims(image, axis=0)
    input_image = tf.image.resize_with_pad(input_image, input_size, input_size)
    input_image = tf.cast(input_image, tf.uint8)
    return input_image

### Compilation with MERA software stack helper

In [4]:
def mera_compile(tflite_filename, image_path, platform, host_arch, output_dir):
    with mera.Deployer(output_dir, overwrite=True) as deployer:
        model = mera.ModelLoader(deployer).from_tflite(tflite_filename)
        # Get input dimensions from model
        input_size, _ = list(model.input_desc.values())[0]
        # Grab the 'height' component 
        input_size = input_size[1]
        input_data = np.array(load_image(image_path, input_size))

        deployer.deploy(model, mera_platform=platform, target=Target.IP, host_arch=host_arch)
    return input_data

### Models compilation

In [5]:
image_path = 'cat.png'
output_dir_lite1 = "deploy_effnet_lite1"
output_dir_lite4 = "deploy_effnet_lite4"
platform = Platform.DNAF300L0001  # for intel FPGA
host_arch = "x86"

input_data_eflite1 = mera_compile("effnet-lite1.tflite", image_path, platform, host_arch, output_dir_lite1)
input_data_eflite4 = mera_compile("effnet-lite4.tflite", image_path, platform, host_arch, output_dir_lite4)



### Load deployment directories

In [6]:
ip_lite1 = mera.load_mera_deployment(output_dir_lite1)
ip_lite4 = mera.load_mera_deployment(output_dir_lite4)

### Inference on hardware IP

In [None]:
def get_total_latency_ms(run_result, latency_key_name = 'elapsed_latency'):
    metrics = run_result.get_runtime_metrics()
    total_us = sum([x[latency_key_name] for x in metrics])
    return total_us / 1000

mera_runner_lite1 = ip_lite1.get_runner().set_input(input_data_eflite1).run()
mera_result_lite1 = mera_runner_lite1.get_outputs()
print("Optimized inference latency efficient net lite 1 (IP):", get_total_latency_ms(mera_runner_lite1), "ms")

mera_runner_lite4 = ip_lite4.get_runner().set_input(input_data_eflite4).run()
mera_result_lite4 = mera_runner_lite4.get_outputs()
print("Optimized inference latency efficient net lite 4 (IP):", get_total_latency_ms(mera_runner_lite4), "ms")

### Finally check the results from hardware

In [None]:
from tvm.contrib.download import download_testdata
def get_synset():
    synset_url = "".join(
        [
            "https://gist.githubusercontent.com/zhreshold/",
            "4d0b62f3d01426887599d4f7ede23ee5/raw/",
            "596b27d23537e5a1b5751d2b0481ef172f58b539/",
            "imagenet1000_clsid_to_human.txt",
        ]
    )
    synset_name = "imagenet1000_clsid_to_human.txt"
    synset_path = download_testdata(synset_url, synset_name, module="data")
    with open(synset_path) as f:
        return eval(f.read())
synset = get_synset()

In [None]:
mera_top3_labels_lite1 = np.argsort(mera_result_lite1[0][0])[::-1][:3]
mera_top3_labels_lite4 = np.argsort(mera_result_lite4[0][0])[::-1][:3]
print("MERA compiled top3 labels lite 1:", [synset[label] for label in mera_top3_labels_lite1])
print("MERA compiled top3 labels lite 4:", [synset[label] for label in mera_top3_labels_lite4])