<br />
<div align="center">
  <a href="https://deepwok.github.io/">
    <img src="../imgs/deepwok.png" alt="Logo" width="160" height="160">
  </a>

  <h1 align="center">Lab 4 for Advanced Deep Learning Systems (ADLS) - Hardware Stream</h1>

  <p align="center">
    ELEC70109/EE9-AML3-10/EE9-AO25
    <br />
		Written by
    <a href="https://aaron-zhao123.github.io/">Aaron Zhao, Pedro Gimenes </a>
  </p>
</div>

# General introduction

In this lab, you will learn how to emit SystemVerilog code for a neural network that's been transformed and optimized by MASE. Then, you'll design some hardware for a new Pytorch layer, and simulate the hardware using your new module.

# The Hardware Emit pass

The `emit_verilog` transform pass generates a top-level RTL file and testbench file according to the `MaseGraph`, which includes a hardware implementation of each layer in the network. This top-level file instantiates modules from the `components` library in MASE and/or modules generated using [HLS](https://en.wikipedia.org/wiki/High-level_synthesis), when internal components are not available. The hardware can then be simulated using [Verilator](https://www.veripool.org/verilator/), or deployed on an FPGA.

First, add Machop to your system PATH (if you haven't already done so) and import the required libraries.

In [17]:
import os, sys
import torch 
torch.manual_seed(0)

from chop.ir.graph.mase_graph import MaseGraph

from chop.passes.graph.analysis import (
    init_metadata_analysis_pass,
    add_common_metadata_analysis_pass,
    add_hardware_metadata_analysis_pass,
    report_node_type_analysis_pass,
)

from chop.passes.graph.transforms import (
    emit_verilog_top_transform_pass,
    emit_internal_rtl_transform_pass,
    emit_bram_transform_pass,
    emit_cocotb_transform_pass,
    quantize_transform_pass,
)

from chop.tools.logger import set_logging_verbosity

set_logging_verbosity("debug")

import toml
import torch
import torch.nn as nn

# TO DO: remove
import os
os.environ["PATH"] = "/opt/homebrew/bin:" + os.environ["PATH"]
!verilator

[32mINFO    [0m [34mSet logging level to debug[0m


Usage:
        verilator --help
        verilator --version
        verilator --binary -j 0 [options] [source_files.v]... [opt_c_files.cpp/c/cc/a/o/so]
        verilator --cc [options] [source_files.v]... [opt_c_files.cpp/c/cc/a/o/so]
        verilator --sc [options] [source_files.v]... [opt_c_files.cpp/c/cc/a/o/so]
        verilator --lint-only -Wall [source_files.v]...



Now, define the neural network. We're using a model which can be used to perform digit classification on the MNIST dataset.

In [18]:
class MLP(torch.nn.Module):
    """
    Toy FC model for digit recognition on MNIST
    """

    def __init__(self) -> None:
        super().__init__()

        self.fc1 = nn.Linear(8, 8)

    def forward(self, x):
        # x = torch.flatten(x, start_dim=1, end_dim=-1)
        # x = torch.nn.functional.relu(self.fc1(x))
        x = self.fc1(x)
        return x

Now, we'll generate a MaseGraph and add metadata. 

Before running `emit_verilog`, we'll quantize the model to fixed precision. Refer back to [lab 3](https://deepwok.github.io/mase/modules/labs_2023/lab3.html) if you've forgotten how this works. Check that the data type for each node is correct after quantization.

In [19]:
mlp = MLP()
mg = MaseGraph(model=mlp)

# Provide a dummy input for the graph so it can use for tracing
batch_size = 4
x = torch.randn((batch_size, 8))
dummy_in = {"x": x}
mlp.forward(x)

mg, _ = init_metadata_analysis_pass(mg, None)
mg, _ = add_common_metadata_analysis_pass(
    mg, {"dummy_in": dummy_in, "add_value": False}
)

quan_args = {
        "by": "type",
        "default": {
            "config": {
                "name": "mxint",
                # data
                "data_in_width": 12,
                "data_in_exponent_width": 4,
                "weight_block_size": [1, 2],
                # weight
                "weight_width": 12,
                "weight_exponent_width": 4,
                "bias_block_size": [2, 2],
                # bias
                "bias_width": 12,
                "bias_exponent_width": 4,
                "data_in_block_size": [1, 2],
            }
        },
    }

mg, _ = quantize_transform_pass(mg, quan_args)

_ = report_node_type_analysis_pass(mg)

for node in mg.fx_graph.nodes:
    # if not node.meta['mase']['hardware']['is_implicit']:
    print(node, node.meta['mase']['common'])

[36mDEBUG   [0m [34mgraph():
    %x : [num_users=1] = placeholder[target=x]
    %fc1 : [num_users=1] = call_module[target=fc1](args = (%x,), kwargs = {})
    return fc1[0m
[32mINFO    [0m [34mInspecting graph [add_common_node_type_analysis_pass][0m
[32mINFO    [0m [34m
Node name    Fx Node op    Mase type            Mase op      Value type
-----------  ------------  -------------------  -----------  ------------
x            placeholder   placeholder          placeholder  NA
fc1          call_module   module_related_func  linear       mxint
output       output        output               output       NA[0m


x {'mase_type': 'placeholder', 'mase_op': 'placeholder', 'args': {}, 'results': OrderedDict([('data_out_0', {'type': 'float', 'precision': [32], 'shape': [4, 8], 'torch_dtype': torch.float32})])}
fc1 {'mase_type': 'module_related_func', 'mase_op': 'linear', 'args': OrderedDict([('data_in_0', {'shape': [4, 8], 'torch_dtype': torch.float32, 'type': 'mxint', 'precision': [12, 4]}), ('weight', {'type': 'mxint', 'precision': [12, 4], 'shape': [8, 8], 'from': None}), ('bias', {'type': 'mxint', 'precision': [12, 4], 'shape': [1, 8], 'from': None})]), 'results': OrderedDict([('data_out_0', {'type': 'mxint', 'precision': [12, 4], 'shape': [4, 8], 'torch_dtype': torch.float32})])}
output {'mase_type': 'output', 'mase_op': 'output', 'args': {}, 'results': OrderedDict([('data_out_0', {'type': 'float', 'precision': [32], 'shape': [4, 8], 'torch_dtype': torch.float32})])}


At this point, it's important to run the `add_hardware_metadata` analysis pass. This adds all the required metadata which is later used by the `emit_verilog` pass, including:

1. The node's toolchain, which defines whether we use internal Verilog modules from the `components` library or the HLS flow.
2. The Verilog parameters associated with each node.

> **_TASK:_** Read [this page](https://deepwok.github.io/mase/modules/chop/analysis/add_metadata.html#add-hardware-metadata-analysis-pass) for more information on the hardware metadata pass.

In [20]:
import json
from chop.passes.graph.analysis.report import report_node_shape_analysis_pass
mg, _ = add_hardware_metadata_analysis_pass(mg)

report_node_shape_analysis_pass(mg, {})

for node in mg.fx_graph.nodes:
    if not node.meta['mase']['hardware']['is_implicit']:
        print(node, node.meta['mase']['common'], json.dumps(node.meta['mase']['hardware'], indent=2))

[32mINFO    [0m [34mInspecting graph [add_common_node_shape_analysis_pass][0m
[32mINFO    [0m [34mx:
in:
out:
data_out_0 = [4, 8]

fc1:
in:
data_in_0 = [4, 8]
weight = {'type': 'mxint', 'precision': [12, 4], 'shape': [8, 8], 'from': None}
bias = {'type': 'mxint', 'precision': [12, 4], 'shape': [1, 8], 'from': None}
out:
data_out_0 = [4, 8]

output:
in:
out:
data_out_0 = [4, 8]

[0m


x
fc1
output
fc1 {'mase_type': 'module_related_func', 'mase_op': 'linear', 'args': OrderedDict([('data_in_0', {'shape': [4, 8], 'torch_dtype': torch.float32, 'type': 'mxint', 'precision': [12, 4]}), ('weight', {'type': 'mxint', 'precision': [12, 4], 'shape': [8, 8], 'from': None}), ('bias', {'type': 'mxint', 'precision': [12, 4], 'shape': [1, 8], 'from': None})]), 'results': OrderedDict([('data_out_0', {'type': 'mxint', 'precision': [12, 4], 'shape': [4, 8], 'torch_dtype': torch.float32})])} {
  "is_implicit": false,
  "device_id": -1,
  "interface": {
    "weight": {
      "storage": "BRAM",
      "transpose": false
    },
    "bias": {
      "storage": "BRAM",
      "transpose": false
    }
  },
  "toolchain": "INTERNAL_RTL",
  "module": "mxint_linear",
  "dependence_files": [
    "linear_layers/mxint_operators/rtl/mxint_linear.sv",
    "linear_layers/mxint_operators/rtl/mxint_circular.sv",
    "memory/rtl/input_buffer.sv",
    "linear_layers/mxint_operators/rtl/mxint_dot_product.sv"

Finally, run the emit verilog pass to generate the SystemVerilog files.

In [21]:
mg, _ = emit_verilog_top_transform_pass(mg)
mg, _ = emit_internal_rtl_transform_pass(mg)

[32mINFO    [0m [34mEmitting Verilog...[0m
[32mINFO    [0m [34mEmitting internal components...[0m


The generated files should now be found under `top/hardware`. 

> **_TASK:_** Read through `top/hardware/rtl/top.sv` and make sure you understand how our MLP model maps to this hardware design. 

You will notice the following instantiated modules:

* `fixed_linear`: this is found under `components/linear/fixed_linear.sv` and implements each Linear layer in the model.
* `fc<layer number>_weight/bias_source`: these are [BRAM](https://nandland.com/lesson-15-what-is-a-block-ram-bram/) memories which drive the weights and biases into the linear layers for computation.
* `fixed_relu`: found under `components/activations/fixed_relu.sv`, implements the ReLU activation.

As of now, we can't yet run a simulation on the model, as we haven't yet generated the memory components. To do this, run the `emit_bram` transform pass as follows, which will generate the memory initialization files and SystemVerilog modules to drive weights and biases into the linear layers. Finally, the `emit_verilog_tb` transform pass will generate the testbench files.


In [22]:
mg, _ = emit_bram_transform_pass(mg)

[32mINFO    [0m [34mEmitting BRAM...[0m
[36mDEBUG   [0m [34mEmitting DAT file for node: fc1, parameter: weight[0m
[36mDEBUG   [0m [34mInit data weight successfully written into /home/splogdes/.mase/top/hardware/rtl/fc1_weight_rom_block.dat[0m
[36mDEBUG   [0m [34mInit data weight successfully written into /home/splogdes/.mase/top/hardware/rtl/fc1_weight_rom_exp.dat[0m
[36mDEBUG   [0m [34mEmitting DAT file for node: fc1, parameter: bias[0m
[36mDEBUG   [0m [34mInit data bias successfully written into /home/splogdes/.mase/top/hardware/rtl/fc1_bias_rom_block.dat[0m
[36mDEBUG   [0m [34mInit data bias successfully written into /home/splogdes/.mase/top/hardware/rtl/fc1_bias_rom_exp.dat[0m


tensor([[  -11.,  -558.,  -939.,  -845.],
        [  777.,   388.,  -667.,  1245.],
        [-1192.,   -29., -1012.,   646.],
        [-1066.,  1148., -1356.,   702.],
        [ -129., -1383.,    76., -1046.],
        [  383.,  -959.,  -742.,  -747.],
        [ -438.,  -597.,   245.,   914.],
        [ -285.,    54., -1352.,   849.],
        [  573.,   526.,  -642.,   575.],
        [  869.,  1203.,   -52.,   196.],
        [ -982.,  -298.,   926.,   971.],
        [ -631.,  1084.,  1440.,  -853.],
        [ -233.,  -912.,   270.,   655.],
        [  153.,  -367., -1123.,   582.],
        [ 1311.,  -564., -1004.,  -858.],
        [-1343.,  1251.,  -748.,   437.]])
tensor([[5., 5., 5., 5.]])
tensor([[ 1590.,  -366.,   111.,   671.],
        [  898.,  1391., -1116.,  -531.]])
tensor([[4.],
        [5.]])


In [23]:
mg, _ = emit_cocotb_transform_pass(mg)


[32mINFO    [0m [34mEmitting testbench...[0m


> **_TASK:_** Now, you're ready to launch a simulation by calling the simulate action as follows.

In [24]:
from chop.actions import simulate

simulate(skip_build=False, skip_test=False)

INFO: Running command perl /usr/bin/verilator -cc --exe -Mdir /home/splogdes/Documents/UNI/ADL/mase/docs/labs/sim_build -DCOCOTB_SIM=1 --top-module top --vpi --public-flat-rw --prefix Vtop -o top -LDFLAGS '-Wl,-rpath,/home/splogdes/Documents/UNI/ADL/venv/lib/python3.11/site-packages/cocotb/libs -L/home/splogdes/Documents/UNI/ADL/venv/lib/python3.11/site-packages/cocotb/libs -lcocotbvpi_verilator' -Wno-fatal -Wno-lint -Wno-style --trace-fst --trace-structs --trace-depth 3 -I/home/splogdes/.mase/top/hardware/rtl -I/home/splogdes/Documents/UNI/ADL/mase/src/mase_components/vivado/rtl -I/home/splogdes/Documents/UNI/ADL/mase/src/mase_components/normalization_layers/rtl -I/home/splogdes/Documents/UNI/ADL/mase/src/mase_components/cast/rtl -I/home/splogdes/Documents/UNI/ADL/mase/src/mase_components/activation_layers/rtl -I/home/splogdes/Documents/UNI/ADL/mase/src/mase_components/hls/rtl -I/home/splogdes/Documents/UNI/ADL/mase/src/mase_components/systolic_arrays/rtl -I/home/splogdes/Documents/UN

%Error: /home/splogdes/.mase/top/hardware/rtl/log2_max_abs.sv:36:3: syntax error, unexpected ')', expecting IDENTIFIER-for-type
   36 |   ) max_bas_i (
      |   ^
%Error: /home/splogdes/.mase/top/hardware/rtl/log2_max_abs.sv:48:3: syntax error, unexpected ')', expecting IDENTIFIER-for-type
   48 |   ) log2_i (
      |   ^
%Error: /home/splogdes/.mase/top/hardware/rtl/mxint_cast.sv:49:3: syntax error, unexpected ')', expecting IDENTIFIER-for-type
   49 |   ) max_bas_i (
      |   ^
%Error: Exiting due to 3 error(s)


SystemExit: Process 'perl' terminated with error 1

The `simulate` action creates a `dump.vcd` file within the `sim_build` directory, which contains the waveform trace of the simulation. The waveforms can be opened with a viewer like GTKWave.

> **TASK**: Follow the instructions [here](https://gtkwave.sourceforge.net/) to install GTKWave on your platform, then open the generated trace file to inspect the signals in the simulation.

# Main Task

Pytorch has a number of layers which are available to users to define neural network models. At the moment, `emit_verilog` supports generating Verilog for models including Linear layers and the ReLU activation.

> **_MAIN TASK:_** choose another layer type from the [Pytorch list](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity) and write a SystemVerilog file to implement that layer in hardware. Then, change the generated `top.sv` file to inject that layer within the design. For example, you may replace the ReLU activations with [Leaky ReLU](https://pytorch.org/docs/stable/generated/torch.nn.RReLU.html#torch.nn.RReLU). Re-run the simulation and observe the effect on latency and accuracy.