<img src="https://github.com/thesps/conifer/blob/master/conifer_v1.png?raw=true" width="250" alt="conifer" />

In this notebook we will learn how to load BDTs onto the `conifer` FPU using the model from `part_1`.

We'll target Xilinx the Alveo U50 card and a prebuilt FPU binary with 100 Tree Engines. First we download that binary from the conifer website.

<img src="https://www.xilinx.com/content/dam/xilinx/imgs/kits/U50_Hero_1_Bracket.png" width=250 alt="U50" />

In [None]:
!wget https://ssummers.web.cern.ch/conifer/downloads/v1.4/alveo/u50_gen3x16_xdma_5_202210_1/fpu_100TE_512N_DS.xclbin

In [None]:
from sklearn.datasets import make_moons
from sklearn.inspection import DecisionBoundaryDisplay
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import expit
import conifer
import json
import os
import sys
import xgboost as xgb

# enable more output from conifer
import logging
logging.basicConfig(stream=sys.stdout, level=logging.WARNING)
logger = logging.getLogger('conifer')
logger.setLevel('DEBUG')

# create a random seed at we use to make the results repeatable
seed = int('fpga_tutorial'.encode('utf-8').hex(), 16) % 2**31

# Forest Processing Unit

Now we will execute of the same model on the same Alveo card, but this time using the reconfigurable `conifer FPU` rather than the static binary we previously used.

We need to load our model in two phases: firstly load the FPU binary onto the FPGA, then load a model onto the FPU.

In [None]:
fpu = conifer.backends.fpu.runtime.AlveoDriver('fpu_100TE_512N_DS.xclbin')

### FPU Config
The configuration used to build the FPU binary is stored as a string as part of the binary itself. When we created the driver above, the configuration was read from the device, and below we print it. The key constraints that will restrict the size of model we can deploy are the number of Tree Engines, the number of nodes per TE, and the number of features.

**Note**: we can only load one binary onto the FPGA at any time, so we cannot keep the `model_u50` above loaded at the same time as the `fpu` below. Save any result data (e.g. test data and predictions) to files in order to make comparisons!

In [None]:
fpu.config

## Load model part 1
Now we have loaded the FPU onto the FPGA, we can load a model onto the FPU's node memories. We need to provide the FPU configuration when we convert the model in order to 'compile' the model into FPU DecisionNode data matching the target architecture.
Specifically we need to set the `'FPU'` section of the `fpu` backend configuration. We can load this from the JSON file saved with the FPU build if we have it, or in this case we use the configuration that we read from the device itself.

If we are using an FPU with the 'dynamic scaler', scale factors for the features will be derived at this step. This step is currently a bit too aggressive, so we do a hack to un-apply the auto-derived scaled, and then apply some more reasonable ones.

**Note** there is no communication with the FPGA at this step, this is all Python running on the host PC.

**Note** the 'compilation' done by conifer is quite simple, just reordering and packing the model data into bits. It is quite fast.

In [None]:
cfg = conifer.backends.fpu.auto_config()
cfg['FPU'] = fpu.config
model_fpu = conifer.model.load_model('prj_conifer_part_1/my_prj.json', new_config=cfg)
model_fpu.scale(1./model_fpu.threshold_scale, model_fpu.score_scale) # unscale
model_fpu.scale(1000, 1000)                                          # rescale

## Load model part 2

Now we download the model onto the FPU (that is already loaded onto the FPGA). As with the static accelerator, we specify the batch size in order to allocate buffers.

In [None]:
model_fpu.attach_device(fpu, batch_size=2500)

In [None]:
X_test = np.load('moons_dataset/X_test.npy').astype('float32')
y_test = np.load('moons_dataset/y_test.npy')
model_py = conifer.model.load_model('prj_conifer_part_2/my_prj.json', new_config={'backend':'py','output_dir':'dummy','project_name':'dummy'})
y_py = model_py.decision_function(X_test)

### Do inference

In [None]:
y_fpu = model_fpu.decision_function(X_test)

In [None]:
xgb_model = xgb.XGBClassifier()
xgb_model.load_model('prj_conifer_part_1/xgboost_model.json')

In [None]:
y_xgb = xgb_model.predict_proba(X_test)

## Compare

Now we'll plot the decision boundary again, this time comparing the FPU to xgboost output

In [None]:
# make a 1000x1000 grid of points in the feature space
X_mesh = np.meshgrid(np.linspace(-3, 3, 1000), np.linspace(-3, 3, 1000))
# reshape them for inference
X_grid = np.vstack([X_mesh[0].ravel(), X_mesh[1].ravel()]).T.astype('float32')
model_fpu.attach_device(fpu, batch_size=X_grid.shape[0]) # reinitialize the FPU with the batch size of the full dataset
# run emulated inference, compute the class probability, reshape to 1000x1000 grid
y_hls_mesh = np.reshape(expit(model_fpu.decision_function(X_grid)), X_mesh[0].shape)
# run the xgboost prediction on the same grid
y_xgb_mesh = np.reshape(xgb_model.predict_proba(X_grid)[:,1], X_mesh[0].shape)

In [None]:
# display the boundaries, and the difference
f, axs = plt.subplots(1, 3, figsize=(15,5))

# plot HLS
display = DecisionBoundaryDisplay(xx0=X_mesh[0], xx1=X_mesh[1], response=y_hls_mesh)
display.plot(cmap='PiYG', ax=axs[0])
axs[0].scatter(X_test[:,0][:200], X_test[:,1][:200], c=y_test[:200], cmap='PiYG', edgecolors='k')
axs[0].set_title('FPU')

# plot the XGBoost
display = DecisionBoundaryDisplay(xx0=X_mesh[0], xx1=X_mesh[1], response=y_xgb_mesh)
display.plot(cmap='PiYG', ax=axs[1])
axs[1].scatter(X_test[:,0][:200], X_test[:,1][:200], c=y_test[:200], cmap='PiYG', edgecolors='k')
axs[1].set_title('XGBoost')

# plot the difference
pcm = axs[2].pcolormesh(X_mesh[0], X_mesh[1], y_xgb_mesh-y_hls_mesh)
axs[2].set_title('XGBoost - HLS')
f.colorbar(pcm)
plt.tight_layout()

## Building FPU

The Forest Processing Unit implementation is in HLS, and `conifer` also provides the interface to build a new architecture from a configuration. Now we will do that, building only the HLS C Synthesis part to take a look at the reports. 

The Alveo U50 is not in the default supported list of boards of conifer, so first of all we register that to conifer using the proper part number.

Try changing some of the configuration (like the number of TEs, nodes, features etc). The configuration is printed at the end of the next cell, so you can see what options can be changed and repeat.

In [None]:
u50 = conifer.backends.boards.AlveoConfig.default_config()
u50['xilinx_part'] = 'xcu50-fsvh2104-2-e'
u50['platform'] = 'xilinx_u50_gen3x16_xdma_5_202210_1'
u50['name'] = 'xilinx_u50_gen3x16_xdma_5_202210_1'
u50 = conifer.backends.boards.AlveoConfig(u50)
conifer.backends.boards.register_board_config(u50.name, u50)

new_fpu_cfg = conifer.backends.fpu.FPUBuilder.default_cfg()
new_fpu_cfg['output_dir'] = 'my_conifer_fpu'
new_fpu_cfg['project_name'] = 'custom_fpu'
new_fpu_cfg['tree_engines'] = 42
new_fpu_cfg['board'] = u50.name
new_fpu_cfg['clock_period'] = 2.5
new_fpu_cfg

In [None]:
# yavin setup
import os
os.environ['PATH'] = '/opt/Xilinx/Vitis_HLS/2023.1/bin/:' + os.environ['PATH']
os.environ['XILINX_HLS'] = '/opt/Xilinx/Vitis_HLS/2023.1/'

In [None]:
# INFN CNAF Setup
import os
os.environ['PATH'] = '/tools/Xilinx/Vitis_HLS/2023.2/bin/:' + os.environ['PATH']
os.environ['XILINX_HLS'] = '/tools/Xilinx/Vitis_HLS/2023.2/'

### Run build
Now we write the HLS project and necessary build scripts, then run the HLS C Synthesis. This will take a few minutes, then check the reports!

In [None]:
fpu_builder = conifer.backends.fpu.FPUBuilder(new_fpu_cfg)
fpu_builder.write()
fpu_builder.build(synth=True, bitfile=False)

# Exercise

Try out the FPU flexibility by training some more BDTs and carrying out inference on the FPGA using the conifer FPU we downloaded at the start.