# From HWGQ-Caffe to HLS: A simple FINN example

**WARNING: FINN is undergoing a major overhaul at the moment, so the information below (especially the implementation details) is subject to change.**

We'll use a pre-trained, binarized network that comes with FINN. As we'll see in a moment, it receives 784 inputs (which is the size of one image in the MNIST dataset), transforms it with three layers of 256 neurons, then produces 10 outputs (one for each digit).

In [13]:
import os
FINN_ROOT = os.environ['FINN_ROOT']
params = FINN_ROOT + "/inputs/sfc-w1a1.caffemodel"
topology = FINN_ROOT + "/inputs/sfc-w1a1.prototxt"
! ls -l $params
! ls -l $topology

-rw-r--r-- 1 root root 1344691 Jun 11 14:22 /app/FINN/../FINN/FINN/inputs/sfc-w1a1.caffemodel
-rw-r--r-- 1 root root 2964 Feb 20 14:52 /app/FINN/../FINN/FINN/inputs/sfc-w1a1.prototxt


In [22]:
! head -n 58 $topology

name: "sfc-w1a1-hwgq"
input: "data"
input_shape {
  dim: 64
  dim: 1
  dim: 28
  dim: 28
}
layer {
  name: "bn_inp"
  type: "BatchNorm"
  bottom: "data"
  top: "bn_inp"
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
  param {
    lr_mult: 0
  }
  batch_norm_param {
    moving_average_fraction: 0.95
  }
}
layer {
  name: "qt_inp"
  type: "Quant"
  bottom: "bn_inp"
  top: "qt_inp"
  quant_param {
    forward_func: "sign"
    backward_func: "hard_tanh"
    clip_thr: 1.0
  }
}
layer {
  name: "ip1"
  type: "BinaryInnerProduct"
  bottom: "qt_inp"
  top: "ip1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  inner_product_param {
    num_output: 256
    bias_term: false
    weight_filler {
      type: "gaussian"
      std: 0.005
    }
  }
  binary_inner_product_param {
    use_alpha: true
  }  
}


Here you can see how the input data is normalized and quantized, and the first binarized fully-connected (here called "inner product") layer. Tools such as [Netscope](https://ethereon.github.io/netscope/#/editor) can help understand the structure of the network.

## Frontend: From Caffe to FINN IR
The frontend stage is responsible for converting QNNs trained by a variety of frameworks to the FINN intermediate representation (IR). As each framework exposes their QNN topologies through custom formats, FINN must first perform a conversion to a common IR that it knows how to process.

In [3]:
# import a trained network through the HWGQ-Caffe frontend
import FINN.frontend.frontend_hwgq as fe
imported_net = fe.importCaffeNetwork(topology, params)
imported_net

[LinearLayer,
 BipolarThresholdingLayer,
 FullyConnectedLayer,
 LinearLayer,
 LinearLayer,
 BipolarThresholdingLayer,
 FullyConnectedLayer,
 LinearLayer,
 LinearLayer,
 BipolarThresholdingLayer,
 FullyConnectedLayer,
 LinearLayer,
 LinearLayer,
 BipolarThresholdingLayer,
 FullyConnectedLayer,
 LinearLayer,
 LinearLayer]

## What functionality does the IR expose?
Let's get acquainted with what the FINN IR looks like and try to look "inside" some of the layers. As FINN is open source you could simply look at the IR source code in `FINN/core/layers.py`, but here we'll use a little helper function to investigate the exposed members and functions dynamically.


In [14]:
def showMembers(layer):
    return filter(lambda x: not x.startswith("_"), dir(layer))

In [31]:
showMembers(imported_net[2])

['W',
 'execute',
 'getInputSize',
 'getNumOps',
 'getOutputSize',
 'getParamSize',
 'getTotalInputBits',
 'getTotalOutputBits',
 'getTotalParamBits',
 'get_filter_dim',
 'get_in_dim',
 'get_out_dim',
 'get_pad',
 'get_parallel',
 'get_stride',
 'get_type',
 'ibits',
 'in_dim',
 'insize',
 'kernel',
 'obits',
 'outsize',
 'updateBitwidths',
 'wbits']

In [38]:
print("Inputs: %d, outputs: %d" % (imported_net[2].getInputSize(), imported_net[2].getOutputSize()))

Inputs: 784, outputs: 256


In fact, large parts of the FINN IR are *executable*, as you probably guessed from the `execute` function. At any point, we can generate a random vector of appropriate dimensions and pass it through this layer by calling `execute` to see what the output would be like.

In [46]:
import numpy as np
rand_inp_vec = np.random.randn(784)
ret = imported_net[2].execute(rand_inp_vec)
ret.shape

(256,)

As expected, we get an output vector of the expected size when we pass in an input vector of appropriate size for this layer.

## The Streamlining Transform

You may have noticed that our imported network contains a bunch of `LinearLayer` instances with floating point parameters (which is currently indicated in FINN IR as 32 bits), like this one:



In [48]:
print("Layer type: %s weight bits: %d" % (imported_net[3].get_type(), imported_net[3].wbits))

Layer type: LinearLayer weight bits: 32


So where did this come from in a binarized network? Many state-of-the-art BNN/QNN methods use some floating point computation in the forward pass to improve the accuracy. Some examples are batch normalization layers and channel-wise scaling factors. Although these layers do not typically contain a large amount of computation, they may still incur slowdowns on devices where floating point operations are expensive and increase the memory footprint of the QNN by adding floating point parameters. This is the case for creating "dataflow-style" FPGA accelerators, so we'd like to somehow get rid of those floating point parameters and operations. 

Fortunately, we can use what is referred to as [streamlining](https://arxiv.org/pdf/1709.04060.pdf) to do this, without losing any accuracy! Streamlining is implemented as a *transformation* in FINN: a function that takes in the FINN IR representation of a network, and returns a transformed FINN IR representation.

In [50]:
# example of a device-neutral transform: streamlining
import FINN.transforms.transformations as tf
streamlined_net = tf.makeCromulent(imported_net)
print(streamlined_net)
print("Number of layers in original imported network: %d" % len(imported_net))
print("Number of layers in streamlined network: %d" % len(streamlined_net))

[BipolarThresholdingLayer, FullyConnectedLayer, BipolarThresholdingLayer, FullyConnectedLayer, BipolarThresholdingLayer, FullyConnectedLayer, BipolarThresholdingLayer, FullyConnectedLayer, LinearLayer]
Number of layers in original imported network: 17
Number of layers in streamlined network: 9


You can see that all `LinearLayer`s besides the final one have disappeared. This is achieved by updating the thresholds of the network. You can read more about how this is done [here](https://arxiv.org/pdf/1709.04060.pdf), or by looking at the source code for this transformation.

In [57]:
print("Original input normalization coefficients: %s %s " % (str(imported_net[0].A), str(imported_net[0].B)))
print("Original input quantization threshold: " + str(imported_net[1].thresholds))
print("Streamlined input quantization threshold: " + str(streamlined_net[0].thresholds))

Original input normalization coefficients: [0.01270615] [-0.42501277] 
Original input quantization threshold: [[0.]]
Streamlined input quantization threshold: [[34]]


In [15]:
showMembers(streamlined_net[0])

['execute',
 'getInputSize',
 'getOutputSize',
 'get_filter_dim',
 'get_in_dim',
 'get_out_dim',
 'get_pad',
 'get_parallel',
 'get_stride',
 'get_type',
 'ibits',
 'obits',
 'thresholds',
 'updateBitwidths']

In [6]:
# inspect weights of first FC layer
print streamlined_net[1].W

[[-1. -1. -1. ...  1.  1.  1.]
 [-1.  1.  1. ...  1.  1.  1.]
 [ 1.  1. -1. ... -1.  1.  1.]
 ...
 [ 1.  1. -1. ...  1.  1.  1.]
 [ 1. -1.  1. ... -1.  1. -1.]
 [-1. -1. -1. ...  1. -1. -1.]]
