<a href="https://colab.research.google.com/github/uwsampl/tutorial/blob/master/tutorial/notebook/05_TVM_Tutorial_TSIM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TSIM: Cycle Accurate Simulation for Custom HW in TVM

TSIM uses [Verilator](https://www.veripool.org/wiki/verilator) to integrate accelerators, including VTA, into TVM and provides flexibility in the hardware language used to implement them.
For example, one could use OpenCL, C/C++ or Chisel3 to describe a VTA design that would eventually be compiled down to Verilog, since it is the standard input language for FPGA/ASIC tools.
Additionally, Verilator supports the Direct Programming Interface (DPI), which is part of the Verilog standard and provides a mechanism to support foreign programming languages.

We leveraged these features available in Verilator and created DPI modules that provide interfaces to hardware and software. The following figure describes at higher level what TSIM can do.

<img src="https://raw.githubusercontent.com/vegaluisjose/fcrc-images/master/overview.png" width="640">

## Hardware DPI module

Normally, a hardware accelerator interface can be simplified in two main components, one for control and another for data. The control interface is driven by a host CPU, whereas the data interface is connected to either external memories (DRAM) or internal memories in the form of scratchpads or caches. Currently, we support a shared-memory model between the host and accelerator. This implies that the host is in charge of passing values and addresses or pointers, including data and code if needed, to the accelerator.


There are two hardware modules written in Verilog implementing these two interfaces called `VTAHostDPI.v` and `VTAMemDPI.v`. Accelerators implemented in Verilog can use these modules directly. However, we also provide Chisel3 wrappers `BlackBox` for accelerators described in this language.

The following block diagram shows how to wire-up an accelerator to the host and memory interface.

<img src="https://raw.githubusercontent.com/vegaluisjose/fcrc-images/master/hwapi.png" width="640">

## Software DPI module

The software DPI module allows users to write drivers to handle the accelerator. For example, some accelerators may need to know memory addresses before issuing data or code requests to memory. This module provides this support via functions that write and read register in the accelerator such as:
```c

// Read an accelerator register
uint32_t ReadReg(int addr);

// Write an accelerator register
void WriteReg(int addr, uint32_t value);
```

In addition to accessing registers, users can manage the hardware simulation thread with launch and finish functions.

```c
// Launch hardware simulation until accelerator finishes or reach max_cycles
void Launch(uint64_t max_cycles);

// Finish hardware simulation
void Finish();
```

# Setup

## Get TVM

In [0]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    ! gsutil cp "gs://tvm-fcrc-binariesd5fce43e-8373-11e9-bfb6-0242ac1c0002/tvm.tar.gz" /tmp/tvm.tar.gz
    ! mkdir -p /tvm
    ! tar -xf /tmp/tvm.tar.gz --strip-components=4 --directory /tvm
    ! ls -la /tvm
    ! bash /tvm/package.sh
    # Add TVM to the Python path.
    import sys
    sys.path.append('/tvm/python')
    sys.path.append('/tvm/topi/python')
    sys.path.append('/tvm/nnvm/python')
    sys.path.append('/tvm/vta/python')
else:
    print("Notebook executing locally, skipping Colab setup ...")

# Vanilla accelerator

We built a vanilla accelerator to showcase how TSIM works in TVM. The vanilla accelerator is implemented in two hardware backends, including Verilog and Chisel3, to demonstrate the flexibility of this infrastructure and help users understand how to add accelerators written in Verilog and "hardware languages" that can generate Verilog. 

The accelerator performs the operation **A = B + C**, where **A** and **B** are 1-D tensors and **C** just a constant. The following figure shows the hardware architecture.

<img src="https://raw.githubusercontent.com/vegaluisjose/fcrc-images/master/accel.png" width="320">

## Verilog backend

### Source files

In [0]:
%%bash
tree -C /tvm/vta/apps/tsim_example/hardware/verilog

### How to build

In [0]:
%%bash
cd /tvm/vta/apps/tsim_example/hardware/verilog
make

## Chisel3 backend

### Source files

In [4]:
%%bash
tree -C /tvm/vta/apps/tsim_example/hardware/chisel/src

[01;34m/tvm/vta/apps/tsim_example/hardware/chisel/src[00m
├── [01;34mmain[00m
│   └── [01;34mscala[00m
│       └── [01;34maccel[00m
│           ├── Accel.scala
│           ├── Compute.scala
│           └── RegFile.scala
└── [01;34mtest[00m
    └── [01;34mscala[00m
        └── [01;34mdut[00m
            └── TestAccel.scala

6 directories, 4 files


### How to build

In [0]:
%%bash
cd /tvm/vta/apps/tsim_example/hardware/chisel
make

## Software driver

### Source files

In [0]:
%%bash
cat /tvm/vta/apps/tsim_example/src/driver.cc

### How to build

In [0]:
%%bash
cd /tvm/vta/apps/tsim_example
make driver

## Create a test

In [0]:
import tvm
import numpy as np
import ctypes

In [0]:
def tsim(hw_backend):
  def load_dll(dll):
    try:
      return [ctypes.CDLL(dll, ctypes.RTLD_GLOBAL)]
    except OSError:
      return []

  def run(a, b):
    if hw_backend in ["chisel"]:
      hw_lib = '/tvm/vta/apps/tsim_example/hardware/chisel/build/libhw.so'
    else:
      hw_lib = '/tvm/vta/apps/tsim_example/hardware/verilog/build/libhw.so'
    sw_lib = '/tvm/vta/apps/tsim_example/build/libsw.so'
    load_dll(sw_lib)
    f = tvm.get_global_func("tvm.vta.driver")
    m = tvm.module.load(hw_lib, "vta-tsim")
    f(m, a, b)
  return run

In [0]:
def test_accel(n, hw_backend):
    ctx = tvm.cpu(0)
    rmax = 64
    a = tvm.nd.array(np.random.randint(rmax, size=n).astype("uint64"), ctx)
    b = tvm.nd.array(np.zeros(n).astype("uint64"), ctx)
    f = tsim(hw_backend)
    f(a, b)
    for i, (x, y) in enumerate(zip(a.asnumpy(), b.asnumpy())):
      print("i:{} a:{} b:{}".format(i, x, y))

## Run Accelerator in Verilog

In [0]:
test_accel(10, "verilog")

## Run Accelerator in Chisel

In [0]:
test_accel(20, "chisel")