# üöÄ **AIE4ML Tutorial: From QKeras ‚Üí hls4ml ‚Üí AMD AI Engine**

This tutorial shows how to:

* Build a small **quantized QKeras** model
* Convert to **hls4ml (bit-exact)**
* Convert to **AIE4ML (bit-exact)**
* Compare **x86 simulation output**
* Apply **simple AIE tuning overrides** (parallelism, tiling, placement)
* Inspect **AIE simulation reports**

---

# 1Ô∏è‚É£ Setup & Imports


In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

from qkeras import QDense, QActivation, quantized_bits, quantized_relu

import hls4ml

np.random.seed(42)
tf.random.set_seed(42)

# 2Ô∏è‚É£ Build a Small QKeras Model

Currently supports MLP-style architectures, like Dense ‚Üí ReLU ‚Üí Dense ‚Üí ‚Ä¶.

In [None]:
IN_FEATURES = 128
HIDDEN = 256
OUT_FEATURES = 64


def build_qkeras_model(in_features=128, hidden=256, out_features=64):
    model = Sequential([
        QActivation(quantized_bits(8, 2), name="input_quant", input_shape=(in_features,)),
        QDense(hidden,
               name="qfc1",
               kernel_quantizer=quantized_bits(8,0,alpha=1),
               bias_quantizer=quantized_bits(8,2,alpha=1)),
        QActivation(quantized_relu(8,0), name="qrelu1"),
        QDense(out_features,
               name="qfc2",
               kernel_quantizer=quantized_bits(8,0,alpha=1),
               bias_quantizer=quantized_bits(8,2,alpha=1)),
        QActivation(quantized_bits(8,2), name="output_quant"),
    ])
    return model

model = build_qkeras_model()

model.compile(optimizer=Adam(1e-3), loss="mse")

model.summary()

# 3Ô∏è‚É£ Generate hls4ml config


In [None]:
# Create HLS config from model
cfg = hls4ml.utils.config_from_keras_model(model, granularity='name')

# Explicitly set output precision for last layer when activation is linear.
# Needed because hls4ml may omit linear activation nodes in the graph.
cfg['LayerName']['qfc2']['Precision'] = 'fixed<8,3,TRN,WRAP,0>'
cfg['LayerName']['qfc2_linear']['Precision'] = 'fixed<8,3,TRN,WRAP,0>'

print('Layer precision summary:')
for name, layer_cfg in cfg.get('LayerName', {}).items():
    print(f"  {name}: {layer_cfg.get('Precision', {})}")


# 4Ô∏è‚É£  Convert: Baseline hls4ml + AIE models

We create two compiled projects:

 üîπ `proj_hls/` ‚Äì reference bit-exact model
 
 üîπ `proj_aie/` ‚Äì AIE-backend model



In [None]:
hls_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=cfg,
    output_dir='proj_hls',
    project_name='proj_hls',
    bit_exact=True,
)

# You can specify the batch size and number of graph iterations for AIE backend
BATCH = 8
ITERS = 10
PLATFORM = 'xilinx_vek280_base_202520_1'

aie_model = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=cfg,
    output_dir='proj_aie',
    backend='aie',
    project_name='proj_aie',
    batch_size=BATCH,
    iterations=ITERS,
    part = PLATFORM
)

hls_model.compile()
aie_model.compile()

print("Models compiled.")

# 5Ô∏è‚É£ Bit-Exact Check (HLS vs AIE x86)

We test the first output batch.

The AIE simulator may emit more samples if the graph has multiple iterations.


In [None]:
def compare_bit_exact(hls4ml_model, aie4ml_model, sim_mode = 'x86'):
    x = np.random.random((BATCH, IN_FEATURES)).astype(np.float32)
    y_hls = hls4ml_model.predict(x)
    y_aie = aie4ml_model.predict(x, simulator=sim_mode)[:BATCH]

    mse = np.mean((y_hls - y_aie)**2)
    mae = np.mean(np.abs(y_hls - y_aie))
    max_diff = np.max(np.abs(y_hls - y_aie))

    print("MSE       :", mse)
    print("MAE       :", mae)
    print("Max |diff|:", max_diff)

# compare bit-exactness on the AIE x86 simulator output
compare_bit_exact(hls_model, aie_model)

# 6Ô∏è‚É£ Build the model

Compile the aie_model in `aie` mode to generate the AIE hardware design.

In [None]:
print("Building AIE project...")

aie_model.build()

print("AIE build & compile completed.")

# compare bit-exactness on the AIE HW simulator output
compare_bit_exact(hls_model, aie_model, sim_mode = 'aie')


# 7Ô∏è‚É£ View AIE Simulation Report

The report includes:

* reports on output interval and throughput (across all out ports)
* ports, memory, AIE core usage, and others


In [None]:
from aie4ml.simulation import read_aie_report

report = read_aie_report(aie_model)
report

# 8Ô∏è‚É£ Apply Tuning Overrides (Parallelism, Tiling, Placement)

AIE4ML lets users **override hardware choices** per layer.

### Example knobs:
* Number of parallel cascade chains (`cas_num`)
* Length of each cascade (`cas_length`)
* Tiling sizes (`tile_m`, `tile_n`, `tile_k`)
* AIE tile placement (`row`, `col`)


In [None]:
def tune_first_dense(cfg):
    layers = list(cfg['LayerName'].keys())
    dense_like = [l for l in layers if 'dense' in l.lower() or 'fc' in l.lower()]
    target = dense_like[0]

    cfg['LayerName'][target].update({

        # ‚öôÔ∏è Parallelism:
        #   - cas_num splits the *output features* (N dimension)
        #   - cas_length splits the *input features*  (K dimension)
        # Higher parallelism = more AIE tiles used = higher throughput. Try to keep both <= 8.
        'parallelism': {'cas_num': 2, 'cas_length': 2},

        # üß© Tiling: how the GEMM is partitioned inside the AIE. Default is usually optimal
        # Controls tile sizes along M (batch), K (input features), N (output features).
        'tiling': {'tile_m': 4, 'tile_k': 8, 'tile_n': 8},

        # üìç Placement: hard-pin the AIE layer graph to start from a specific tile (row, col).
        'placement': {'row': 0, 'col': 10},
    })

    print("Tuned:", target)
    return target

tuned_layer = tune_first_dense(cfg)


# 9Ô∏è‚É£ Convert Tuned AIE Model

We keep the same model but pass **the updated hls_config**.

‚û°Ô∏è *Tip:* You can now test impact on bit-exactness and performance.

In [None]:
aie_model_tuned = hls4ml.converters.convert_from_keras_model(
    model,
    hls_config=cfg,
    output_dir='proj_aie_tuned',
    backend='aie',
    project_name='proj_aie_tuned',
    batch_size=BATCH,
    iterations=ITERS,
    part=PLATFORM
)

aie_model_tuned.write()
aie_model_tuned.build()

compare_bit_exact(hls_model, aie_model_tuned, sim_mode = 'aie')

read_aie_report(aie_model_tuned)

# üéâ Tutorial Complete!

