## BitFusion Tutorial

***Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks*** **[\[arXiv\]](https://arxiv.org/abs/1712.01507)** **[\[Github\]](https://github.com/hsharma35/bitfusion)**

### Motivation for BitFusion (Why do you need it?)
As its name suggests, BitFusion provides the dedicated hardware supports for dynamic bit-level allocation in deep neural networks (DNN). Specifically, 1) It is a bit-flexible accelerator that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. 2) It has a lot of benefits and surpass the state-of-the-art accelerators: Eyeriss and Stripes. I will use this tutorial to introduce the concrete usage of BitFusion.

#### Original Codebase

In [1]:
# download the source code of simulator
!git clone https://github.com/hsharma35/bitfusion.git && cd bitfusion

Cloning into 'bitfusion'...


It seems there is a guide to show the usage, but that is only a showcase (i.e., plot code) without a corresponding critial data file.<br>
*check the issue here: https://github.com/hsharma35/bitfusion/issues/2*

#### Prerequisites

In [3]:
!pip install -r requirements.txt
!pip install configparser

[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m
[33mDEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support[0m


In [11]:
!git clone https://github.com/hsharma35/dnnweaver2.git
!mv ./dnnweaver2 ./dnnweaver2-repo
!cp -r ./dnnweaver2-repo/dnnweaver2 dnnweaver2

Cloning into 'dnnweaver2'...
remote: Enumerating objects: 360, done.[K
remote: Total 360 (delta 0), reused 0 (delta 0), pack-reused 360[K
Receiving objects: 100% (360/360), 20.37 MiB | 15.63 MiB/s, done.
Resolving deltas: 100% (116/116), done.


#### Python wrapper to generate SRAM sweeps with different capacity/#banks/widths

In [14]:
# Prerequisites: Clone and compile cacti-7
!cd sram && git clone https://github.com/HewlettPackard/cacti.git

Cloning into 'cacti'...
remote: Enumerating objects: 135, done.[K
remote: Total 135 (delta 0), reused 0 (delta 0), pack-reused 135[K
Receiving objects: 100% (135/135), 285.56 KiB | 0 bytes/s, done.
Resolving deltas: 100% (74/74), done.


In [19]:
# or you can use the released version
! cd sram && wget https://github.com/HewlettPackard/cacti/archive/v6.5.0.tar.gz

--2020-06-19 20:10:46--  https://github.com/HewlettPackard/cacti/archive/v6.5.0.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/HewlettPackard/cacti/tar.gz/v6.5.0 [following]
--2020-06-19 20:10:46--  https://codeload.github.com/HewlettPackard/cacti/tar.gz/v6.5.0
Resolving codeload.github.com (codeload.github.com)... 140.82.112.9
Connecting to codeload.github.com (codeload.github.com)|140.82.112.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘v6.5.0.tar.gz’

    [ <=>                                   ] 91,178      --.-K/s   in 0.06s   

2020-06-19 20:10:47 (1.36 MB/s) - ‘v6.5.0.tar.gz’ saved [91178]



In [21]:
!cd sram && tar -xvzf v6.5.0.tar.gz

cacti-6.5.0/
cacti-6.5.0/README
cacti-6.5.0/Ucache.cc
cacti-6.5.0/Ucache.h
cacti-6.5.0/arbiter.cc
cacti-6.5.0/arbiter.h
cacti-6.5.0/area.cc
cacti-6.5.0/area.h
cacti-6.5.0/bank.cc
cacti-6.5.0/bank.h
cacti-6.5.0/basic_circuit.cc
cacti-6.5.0/basic_circuit.h
cacti-6.5.0/cache.cfg
cacti-6.5.0/cacti.i
cacti-6.5.0/cacti.mk
cacti-6.5.0/cacti_interface.cc
cacti-6.5.0/cacti_interface.h
cacti-6.5.0/component.cc
cacti-6.5.0/component.h
cacti-6.5.0/const.h
cacti-6.5.0/contention.dat
cacti-6.5.0/crossbar.cc
cacti-6.5.0/crossbar.h
cacti-6.5.0/decoder.cc
cacti-6.5.0/decoder.h
cacti-6.5.0/dram.cfg
cacti-6.5.0/htree2.cc
cacti-6.5.0/htree2.h
cacti-6.5.0/io.cc
cacti-6.5.0/io.h
cacti-6.5.0/main.cc
cacti-6.5.0/makefile
cacti-6.5.0/mat.cc
cacti-6.5.0/mat.h
cacti-6.5.0/nuca.cc
cacti-6.5.0/nuca.h
cacti-6.5.0/parameter.cc
cacti-6.5.0/parameter.h
cacti-6.5.0/router.cc
cacti-6.5.0/router.h
cacti-6.5.0/subarray.cc
cacti-6.5.0/subarray.h
cacti-6.5.0/technology.cc
cacti-6.5.0/uca.cc
cacti-6.5.0/uca.h
cacti-6.5.0/wir

In [22]:
# compile the project
!cd ./sram/cacti && make

mkdir obj_dbg
make[1]: Entering directory `/home/hy34/bitfusion/sram/cacti'
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c area.cc -o obj_dbg/area.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c bank.cc -o obj_dbg/bank.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c mat.cc -o obj_dbg/mat.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c main.cc -o obj_dbg/main.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c Ucache.cc -o obj_dbg/Ucache.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c io.cc -o obj_dbg/io.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c technology.cc -o obj_dbg/technology.o
g++ -m64 -Wno-unknown-pragmas -Wall  -ggdb -g -O0 -DNTHREADS=1  -gstabs+  -c basic_circuit.cc -o obj_dbg/basic_circuit.o
[01m[Kbasic_circuit.cc:[m[K In function ‘[01m[Kdouble shortcircuit

In [23]:
# To test the installation of cacti-7 and the python wrapper: 
!cd sram && python cacti_sweep.py

**************************************************
Eyeriss @ 65nm
area: nan mm^2
leakage power: nan mWatt
read energy per bit: nan pJ
write energy per bit: nan pJ
avg energy per bit: nan pJ
**************************************************
BitFusion @ 45nm
{'technology (u)': 0.045, 'block size (bytes)': 4, 'size (bytes)': 128}
   size (bytes)  block size (bytes)  ...  area_mm^2  technology (u)
3         128.0                 4.0  ...   0.001499           0.045

[1 rows x 11 columns]
area: 0.00149898 mm^2
size: 128 bytes
total area: 0.76747776 mm^2
total size: 65536 bytes
read energy per bit: 0.02136565625 pJ
write energy per bit: 0.02891615625 pJ
avg energy per bit: 0.02514090625 pJ
**********
area: 0.015132 mm^2
size: 2048 bytes
total area: 0.242112 mm^2
total size: 32768 bytes
read energy per bit: 0.059474375 pJ
write energy per bit: 0.1137040625 pJ
avg energy per bit: 0.08658921875 pJ
**********
area: 0.00760331 mm^2
size: 512 bytes
total area: 0.24330592 mm^2
total size: 16384 byt

#### Last Step: 
* Replace the sweep.py in `./src/sweep/sweep.py` to the new file (https://drive.google.com/file/d/13yARyK7nFevNF15689Swyh6LgE7M6Qff/view?usp=sharing)
* Replace the stats.py in `./src/simulator/stats.py` to the new file (https://drive.google.com/file/d/1Dlb6r1dfrbXYMaP4fS0Y-2_wpGPbHUH4/view?usp=sharing)

### Then you can run simulation

In [1]:
import pandas
import configparser
import os
import numpy as np
from graph_plot.barchart import BarChart

import matplotlib

import warnings
warnings.filterwarnings('ignore')

import dnnweaver2

import src.benchmarks.benchmarks as benchmarks
from src.simulator.stats import Stats
from src.simulator.simulator import Simulator
from src.sweep.sweep import SimulatorSweep, check_pandas_or_run
from src.utils.utils import *
from src.optimizer.optimizer import optimize_for_order, get_stats_fast

In [2]:
## use a batch size of 16
batch_size = 128

results_dir = './results'
if not os.path.exists(results_dir):
    os.makedirs(results_dir)

fig_dir = './fig'
if not os.path.exists(fig_dir):
    os.makedirs(fig_dir)

## establish systolic arrray
from pandas import DataFrame

## ***Below is the missed file that I recovered from showcase***

In [3]:
data = {'Max Precision (bits)': [8, 8],
        'Min Precision (bits)': [2, 2],
        'M': [32, 4],
        'N': [16, 4],
        'Area (um^2)': [1, 1],
        'Dynamic Power (nW)': [1, 1],
        'Frequency': [1, 1],
        'Leakage Power (nW)': [1, 1]
        }

df = DataFrame(data, columns=list(data.keys()))
export_csv = df.to_csv (r'./results/systolic_array_synth.csv', index = None, header=True)
print (df)

   Frequency  Area (um^2)  Leakage Power (nW)  Dynamic Power (nW)  \
0          1            1                   1                   1   
1          1            1                   1                   1   

   Max Precision (bits)   M  Min Precision (bits)   N  
0                     8  32                     2  16  
1                     8   4                     2   4  


In [4]:
config_file = 'bf_e_conf.ini'
# config_file = 'conf.ini'

# Create simulator object
verbose = False
bf_e_sim = Simulator(config_file, verbose)
bf_e_energy_costs = bf_e_sim.get_energy_cost()
print(bf_e_sim)

energy_tuple = bf_e_energy_costs
print('')
print('*'*50)
print(energy_tuple)

Simulator object
	Max supported precision: 8
	Min supported precision: 2
	Systolic array size: 16 -inputs x 32 -outputs
	Wbuf size: 65,536 Bytes
	Ibuf size: 32,768 Bytes
	Obuf size: 16,384 Bytes
Double buffering enabled. Sizes of SRAM are halved

**************************************************
Energy costs for BitFusion
Core dynamic energy : 1000.000 pJ/cycle (for entire systolic array)
WBUF Read energy    : 0.021 pJ/bit
WBUF Write energy   : 0.029 pJ/bit
IBUF Read energy    : 0.059 pJ/bit
IBUF Write energy   : 0.114 pJ/bit
OBUF Read energy    : 0.033 pJ/bit
OBUF Write energy   : 0.073 pJ/bit



### ***You can define your own network (modify the `./src/benchmarks/benchmarks.py`)***
below I use show the default several networks.

In [5]:
# bench_list = benchmarks.FracTrain_benchlist
# bench_list = benchmarks.L2A_benchlist
bench_list = benchmarks.benchlist

sim_sweep_columns = ['N', 'M',
        'Max Precision (bits)', 'Min Precision (bits)',
        'Network', 'Layer',
        'Cycles', 'Memory wait cycles',
        'WBUF Read', 'WBUF Write',
        'OBUF Read', 'OBUF Write',
        'IBUF Read', 'IBUF Write',
        'DRAM Read', 'DRAM Write',
        'Bandwidth (bits/cycle)',
        'WBUF Size (bits)', 'OBUF Size (bits)', 'IBUF Size (bits)',
        'Batch size']

# bf_e_sim_sweep_csv = os.path.join(results_dir, 'bitfusion-eyeriss-sim-sweep.csv')
# if os.path.exists(bf_e_sim_sweep_csv):
#     bf_e_sim_sweep_df = pandas.read_csv(bf_e_sim_sweep_csv)
# else:
#     bf_e_sim_sweep_df = pandas.DataFrame(columns=sim_sweep_columns)
bf_e_sim_sweep_csv = os.path.join(results_dir, 'trial.csv')
if os.path.exists(bf_e_sim_sweep_csv):
    os.remove(bf_e_sim_sweep_csv)
bf_e_sim_sweep_df = pandas.DataFrame(columns=sim_sweep_columns)
print('Got BitFusion Eyeriss, Numbers')

bf_e_results = check_pandas_or_run(bf_e_sim, bf_e_sim_sweep_df, bf_e_sim_sweep_csv, list_bench=bench_list, batch_size=batch_size, config_file='./conf.ini')
bf_e_results = bf_e_results.groupby('Network',as_index=False).agg(np.sum)
export_csv = bf_e_results.to_csv (r'./results/trial_stat.csv', index = None, header=True)
area_stats = bf_e_sim.get_area()

INFO:src.sweep.sweep.Simulator:Simulating Benchmark: AlexNet
INFO:src.sweep.sweep.Simulator:N x M = 16 x 32
INFO:src.sweep.sweep.Simulator:Max Precision (bits): 8
INFO:src.sweep.sweep.Simulator:Min Precision (bits): 2
INFO:src.sweep.sweep.Simulator:Batch size: 128
INFO:src.sweep.sweep.Simulator:Bandwidth (bits/cycle): 192


Got BitFusion Eyeriss, Numbers


INFO:src.sweep.sweep.Simulator:Simulating Benchmark: SVHN
INFO:src.sweep.sweep.Simulator:N x M = 16 x 32
INFO:src.sweep.sweep.Simulator:Max Precision (bits): 8
INFO:src.sweep.sweep.Simulator:Min Precision (bits): 2
INFO:src.sweep.sweep.Simulator:Batch size: 128
INFO:src.sweep.sweep.Simulator:Bandwidth (bits/cycle): 192
INFO:src.sweep.sweep.Simulator:Simulating Benchmark: CIFAR10
INFO:src.sweep.sweep.Simulator:N x M = 16 x 32
INFO:src.sweep.sweep.Simulator:Max Precision (bits): 8
INFO:src.sweep.sweep.Simulator:Min Precision (bits): 2
INFO:src.sweep.sweep.Simulator:Batch size: 128
INFO:src.sweep.sweep.Simulator:Bandwidth (bits/cycle): 192
INFO:src.sweep.sweep.Simulator:Simulating Benchmark: LeNet-5
INFO:src.sweep.sweep.Simulator:N x M = 16 x 32
INFO:src.sweep.sweep.Simulator:Max Precision (bits): 8
INFO:src.sweep.sweep.Simulator:Min Precision (bits): 2
INFO:src.sweep.sweep.Simulator:Batch size: 128
INFO:src.sweep.sweep.Simulator:Bandwidth (bits/cycle): 192
INFO:src.sweep.sweep.Simulator:

In [6]:
from src.simulator.stats import Stats
def df_to_stats(df):
    stats = Stats()
    stats.total_cycles = float(df['Cycles'])
    stats.mem_stall_cycles = float(df['Memory wait cycles'])
    stats.reads['act'] = float(df['IBUF Read'])
    stats.reads['out'] = float(df['OBUF Read'])
    stats.reads['wgt'] = float(df['WBUF Read'])
    stats.reads['dram'] = float(df['DRAM Read'])
    stats.writes['act'] = float(df['IBUF Write'])
    stats.writes['out'] = float(df['OBUF Write'])
    stats.writes['wgt'] = float(df['WBUF Write'])
    stats.writes['dram'] = float(df['DRAM Write'])
    return stats

In [7]:
print('BitFusion-Eyeriss comparison')
eyeriss_area = 3.5*3.5*45*45/65./65.
print('Area budget = {}'.format(eyeriss_area))


print(area_stats)
if abs(sum(area_stats)-eyeriss_area)/eyeriss_area > 0.1:
    print('Warning: BitFusion Area is outside 10% of eyeriss')
print('total_area = {}, budget = {}'.format(sum(area_stats), eyeriss_area))
bf_e_area = sum(area_stats)

baseline_data = []
for bench in bench_list:
    lookup_dict = {'Benchmark': bench}

    # eyeriss_cycles = float(lookup_pandas_dataframe(eyeriss_data_bench, lookup_dict)['time(ms)'])
    # eyeriss_time = eyeriss_cycles / 500.e3 / 16
    # eyeriss_energy = get_eyeriss_energy(lookup_pandas_dataframe(eyeriss_data_bench, lookup_dict))
    # eyeriss_power = eyeriss_energy / eyeriss_time * 1.e-9

    # eyeriss_speedup = eyeriss_time / eyeriss_time
    # eyeriss_energy_efficiency = eyeriss_energy / eyeriss_energy

    # eyeriss_ppa = eyeriss_speedup / eyeriss_area / (eyeriss_speedup / eyeriss_area)
    # eyeriss_ppw = eyeriss_speedup / eyeriss_power / (eyeriss_speedup / eyeriss_power)

    bf_e_stats = df_to_stats(bf_e_results.loc[bf_e_results['Network'] == bench])
    bf_e_cycles = bf_e_stats.total_cycles * (batch_size / 16.)
    bf_e_time = bf_e_cycles / 500.e3 #/ 16
    cc_energy, mem_energy = bf_e_stats.get_energy(bf_e_sim.get_energy_cost())
    cc_energy = cc_energy * (batch_size / 16.)
    mem_energy = mem_energy * (batch_size / 16.)
    print('cc_energy: {}, mem_energy: {}'.format(cc_energy*1.e-9, mem_energy*1.e-9))
    bf_e_energy = cc_energy + mem_energy
    bf_e_power = bf_e_energy / bf_e_time * 1.e-9

    # bf_e_speedup = eyeriss_time / bf_e_time
    # bf_e_energy_efficiency = eyeriss_energy / bf_e_energy

    # bf_e_ppa = bf_e_speedup / bf_e_area / (eyeriss_speedup / eyeriss_area)
    # bf_e_ppw = bf_e_speedup / bf_e_power / (eyeriss_speedup / eyeriss_power)

    # baseline_data.append(['Performance', bench, bf_e_speedup])
    # baseline_data.append(['Energy reduction', bench, bf_e_energy_efficiency])
    # baseline_data.append(['Performance-per-Watt', bench, bf_e_ppw])
    # baseline_data.append(['Performance-per-Area', bench, bf_e_ppa])

    print('*'*50)
    print('Benchmark: {}'.format(bench))
    # print('Eyeriss time: {} ms'.format(eyeriss_time))
    print('BitFusion time: {} ms'.format(bf_e_time))
    # print('Eyeriss power: {} mWatt'.format(eyeriss_power*1.e3*16))
    print('BitFusion power: {} mWatt'.format(bf_e_power*1.e3*16))
    # print('BitFusion energy: {} mJ'.format(bf_e_time*bf_e_power*1.e3*16 / 1.e3))
    print('BitFusion energy: {} J'.format(bf_e_energy*1.e-9))
    print('*'*50)

BitFusion-Eyeriss comparison
Area budget = 5.87130177515
(1e-06, 0.76747776, 0.242112, 0.24330592)
total_area = 1.25289668, budget = 5.87130177515
cc_energy: 0.246985984, mem_energy: 0.997503628355
**************************************************
Benchmark: AlexNet
BitFusion time: 602.663856 ms
BitFusion power: 33.0397013185 mWatt
BitFusion energy: 1.24448961236 J
**************************************************
cc_energy: 0.006242816, mem_energy: 0.0322949763313
**************************************************
Benchmark: SVHN
BitFusion time: 21.080368 ms
BitFusion power: 29.2501856372 mWatt
BitFusion energy: 0.0385377923313 J
**************************************************
cc_energy: 0.014172672, mem_energy: 0.0893990093133
**************************************************
Benchmark: CIFAR10
BitFusion time: 52.699264 ms
BitFusion power: 31.4453518936 mWatt
BitFusion energy: 0.103571681313 J
**************************************************
cc_energy: 0.001802496, mem_energy