<a href="https://colab.research.google.com/github/uwsampl/tutorial/blob/master/notebook/05_TVM_Tutorial_TSIM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TSIM: Cycle Accurate Simulation for Custom HW in TVM

TSIM uses [Verilator](https://www.veripool.org/wiki/verilator) to integrate accelerators, including VTA, into TVM and provides flexibility in the hardware language used to implement them.
For example, one could use OpenCL, C/C++ or Chisel3 to describe a VTA design that would eventually be compiled down to Verilog, since it is the standard input language for FPGA/ASIC tools.
Additionally, Verilator supports the Direct Programming Interface (DPI), which is part of the Verilog standard and provides a mechanism to support foreign programming languages.

We leveraged these features available in Verilator and created DPI modules that provide interfaces to hardware and software. The following figure describes at higher level what TSIM can do.

<img src="https://raw.githubusercontent.com/vegaluisjose/fcrc-images/master/overview.png" width="640">

## Hardware DPI module

Normally, a hardware accelerator interface can be simplified in two main components, one for control and another for data. The control interface is driven by a host CPU, whereas the data interface is connected to either external memories (DRAM) or internal memories in the form of scratchpads or caches. Currently, we support a shared-memory model between the host and accelerator. This implies that the host is in charge of passing values and addresses or pointers, including data and code if needed, to the accelerator.


There are two hardware modules written in Verilog implementing these two interfaces called `VTAHostDPI.v` and `VTAMemDPI.v`. Accelerators implemented in Verilog can use these modules directly. However, we also provide Chisel3 wrappers `BlackBox` for accelerators described in this language.

The following block diagram shows how to wire-up an accelerator to the host and memory interface.

<img src="https://raw.githubusercontent.com/vegaluisjose/fcrc-images/master/hwapi.png" width="640">

## Software DPI module

The software DPI module allows users to write drivers to handle the accelerator. For example, some accelerators may need to know memory addresses before issuing data or code requests to memory. This module provides this support via functions that write and read register in the accelerator such as:
```c

// Read an accelerator register
uint32_t ReadReg(int addr);

// Write an accelerator register
void WriteReg(int addr, uint32_t value);
```

In addition to accessing registers, users can manage the hardware simulation thread with launch and finish functions.

```c
// Launch hardware simulation until accelerator finishes or reach max_cycles
void Launch(uint64_t max_cycles);

// Finish hardware simulation
void Finish();
```

# Setup

## Get TVM

In [1]:
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
    ! gsutil cp "gs://tvm-fcrc-binariesd5fce43e-8373-11e9-bfb6-0242ac1c0002/tvm.tar.gz" /tmp/tvm.tar.gz
    ! mkdir -p /tvm
    ! tar -xf /tmp/tvm.tar.gz --strip-components=4 --directory /tvm
    ! ls -la /tvm
    ! bash /tvm/package.sh
    # Add TVM to the Python path.
    import sys
    sys.path.append('/tvm/python')
    sys.path.append('/tvm/topi/python')
    sys.path.append('/tvm/nnvm/python')
    sys.path.append('/tvm/vta/python')
else:
    print("Notebook executing locally, skipping Colab setup ...")

Copying gs://tvm-fcrc-binariesd5fce43e-8373-11e9-bfb6-0242ac1c0002/tvm.tar.gz...
- [1 files][115.9 MiB/115.9 MiB]                                                
Operation completed over 1 objects/115.9 MiB.                                    
total 164
drwxr-xr-x 21 root root  4096 Jun 15 00:52 .
drwxr-xr-x  1 root root  4096 Jun 15 00:52 ..
drwx------  8 root root  4096 May 31 08:14 3rdparty
drwx------ 12 root root  4096 Jun 14 21:19 apps
drwx------  3 root root  4096 Jun 15 00:20 build
drwx------  4 root root  4096 Jun 14 21:19 cmake
-rw-------  1 root root 10778 Jun 14 21:19 CMakeLists.txt
drwx------  6 root root  4096 Jun 14 21:19 conda
-rw-------  1 root root  5736 Jun 14 21:19 CONTRIBUTORS.md
drwx------  3 root root  4096 Jun 14 21:19 docker
drwx------ 11 root root  4096 Jun 14 21:19 docs
drwx------  4 root root  4096 Jun 14 21:19 golang
drwx------  3 root root  4096 May 31 08:14 include
-rw-------  1 root root 10542 Jun 14 21:19 Jenkinsfile
drwx------  6 root root  4096 Jun 14 

# Vanilla accelerator

We built a vanilla accelerator to showcase how TSIM works in TVM. The vanilla accelerator is implemented in two hardware backends, including Verilog and Chisel3, to demonstrate the flexibility of this infrastructure and help users understand how to add accelerators written in Verilog and "hardware languages" that can generate Verilog. 

The accelerator performs the operation **A = B + C**, where **A** and **B** are 1-D tensors and **C** just a constant. The following figure shows the hardware architecture.

<img src="https://raw.githubusercontent.com/vegaluisjose/fcrc-images/master/accel.png" width="320">

## Verilog backend

### Source files

In [2]:
%%bash
tree -C /tvm/vta/apps/tsim_example/hardware/verilog

[01;34m/tvm/vta/apps/tsim_example/hardware/verilog[00m
├── Makefile
└── [01;34msrc[00m
    ├── Accel.v
    ├── Compute.v
    ├── RegFile.v
    └── TestAccel.v

1 directory, 5 files


### How to build

In [3]:
%%bash
cd /tvm/vta/apps/tsim_example/hardware/verilog
make

mkdir -p /tvm/vta/apps/tsim_example/hardware/verilog/build
verilator --cc +define+RANDOMIZE_GARBAGE_ASSIGN +define+RANDOMIZE_REG_INIT +define+RANDOMIZE_MEM_INIT --x-assign unique --output-split 20000 --output-split-cfuncs 20000 --top-module TestAccel -Mdir /tvm/vta/apps/tsim_example/hardware/verilog/build /tvm/vta/apps/tsim_example/hardware/verilog/src/Accel.v /tvm/vta/apps/tsim_example/hardware/verilog/src/RegFile.v /tvm/vta/apps/tsim_example/hardware/verilog/src/TestAccel.v /tvm/vta/apps/tsim_example/hardware/verilog/src/Compute.v /tvm/vta/hardware/chisel/src/main/resources/verilog/VTAMemDPI.v /tvm/vta/hardware/chisel/src/main/resources/verilog/VTAHostDPI.v
g++ -O2 -Wall -fPIC -shared -fvisibility=hidden -std=c++11 -DVL_TSIM_NAME=VTestAccel -DVL_PRINTF=printf -DVL_USER_FINISH -DVM_COVERAGE=0 -DVM_SC=0 -Wno-sign-compare -include VTestAccel.h -I/tvm/vta/apps/tsim_example/hardware/verilog/build -I/usr/share/verilator/include -I/usr/share/verilator/include/vltstd -I/tvm/vta/include -I/tv

/tvm/vta/apps/tsim_example/hardware/verilog/build/VTestAccel.cpp: In static member function ‘static void VTestAccel::_sequent__TOP__1(VTestAccel__Syms*)’:
  vlTOPp->TestAccel__DOT__accel__DOT__rf__DOT__rf[6U]
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      = __Vdlyvval__TestAccel__DOT__accel__DOT__rf__DOT__rf__v15;
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  vlTOPp->TestAccel__DOT__accel__DOT__rf__DOT__rf[5U]
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      = __Vdlyvval__TestAccel__DOT__accel__DOT__rf__DOT__rf__v13;
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  vlTOPp->TestAccel__DOT__accel__DOT__rf__DOT__rf[4U]
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      = __Vdlyvval__TestAccel__DOT__accel__DOT__rf__DOT__rf__v11;
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/tvm/vta/apps/tsim_example/hardware/verilog/build/VTestAccel__Slow.cpp: In constructor ‘VTestAccel::VTestAccel(const char*)’:
  

## Chisel3 backend

### Source files

In [4]:
%%bash
tree -C /tvm/vta/apps/tsim_example/hardware/chisel/src

[01;34m/tvm/vta/apps/tsim_example/hardware/chisel/src[00m
├── [01;34mmain[00m
│   └── [01;34mscala[00m
│       └── [01;34maccel[00m
│           ├── Accel.scala
│           ├── Compute.scala
│           └── RegFile.scala
└── [01;34mtest[00m
    └── [01;34mscala[00m
        └── [01;34mdut[00m
            └── TestAccel.scala

6 directories, 4 files


### How to build

In [5]:
%%bash
cd /tvm/vta/apps/tsim_example/hardware/chisel
make

cd /tvm/vta/hardware/chisel && sbt publishLocal
Copying runtime jar.
downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.1.1/sbt-1.1.1.jar ...
	[SUCCESSFUL ] org.scala-sbt#sbt;1.1.1!sbt.jar (97ms)
downloading https://repo1.maven.org/maven2/org/scala-lang/scala-library/2.12.4/scala-library-2.12.4.jar ...
	[SUCCESSFUL ] org.scala-lang#scala-library;2.12.4!scala-library.jar (351ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/main_2.12/1.1.1/main_2.12-1.1.1.jar ...
	[SUCCESSFUL ] org.scala-sbt#main_2.12;1.1.1!main_2.12.jar (100ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/logic_2.12/1.1.1/logic_2.12-1.1.1.jar ...
	[SUCCESSFUL ] org.scala-sbt#logic_2.12;1.1.1!logic_2.12.jar (78ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/actions_2.12/1.1.1/actions_2.12-1.1.1.jar ...
	[SUCCESSFUL ] org.scala-sbt#actions_2.12;1.1.1!actions_2.12.jar (80ms)
downloading https://repo1.maven.org/maven2/org/scala-sbt/main-settings_2.12/1.1.1/main-settings_2.12-

Getting org.scala-sbt sbt 1.1.1  (this may take some time)...
Getting Scala 2.12.4 (for sbt)...
/tvm/vta/apps/tsim_example/hardware/chisel/build/verilator/VTestAccel__Slow.cpp: In constructor ‘VTestAccel::VTestAccel(const char*)’:
     VTestAccel__Syms* __restrict vlSymsp = __VlSymsp = new VTestAccel__Syms(this, name());
                                                                                         ^
/tvm/vta/apps/tsim_example/hardware/chisel/build/verilator/VTestAccel__Slow.cpp:18:89: note: uses ‘void* operator new(long unsigned int)’, which does not have an alignment parameter
/tvm/vta/apps/tsim_example/hardware/chisel/build/verilator/VTestAccel__Slow.cpp:18:89: note: use ‘-faligned-new’ to enable C++17 over-aligned new support
/tvm/vta/hardware/dpi/tsim_device.cc: In function ‘int VTADPISim(uint64_t)’:
/tvm/vta/hardware/dpi/tsim_device.cc:89:27: note: in expansion of macro ‘VL_TSIM_NAME’
   VL_TSIM_NAME* top = new VL_TSIM_NAME;
                           ^~~~~~~~~~~~
<comm

## Software driver

### Source files

In [6]:
%%bash
cat /tvm/vta/apps/tsim_example/src/driver.cc

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *   http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing,
 * software distributed under the License is distributed on an
 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 * KIND, either express or implied.  See the License for the
 * specific language governing permissions and limitations
 * under the License.
 */

#include <tvm/runtime/module.h>
#include <tvm/runtime/registry.h>
#include <vta/dpi/module.h>

namespace vta {
namespace driver {

uint32_t get_half_addr(void *p, bool upper) {
  if (upper) {

### How to build

In [7]:
%%bash
cd /tvm/vta/apps/tsim_example
make driver

mkdir -p /tvm/vta/apps/tsim_example/build
cd /tvm/vta/apps/tsim_example/build && cmake .. && make
-- The C compiler identification is GNU 7.4.0
-- The CXX compiler identification is GNU 7.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /tvm/vta/apps/tsim_example/build
make[1]: Entering directory '/tvm/vta/apps/tsim_example/build'
make[2]: Entering directory '/tvm/vta/apps/tsim_example/build'
make[3]: Entering directory '/tvm/vta/apps/tsim_example/build'
Scanning d

## Create a test

In [0]:
import tvm
import numpy as np
import ctypes

In [0]:
def tsim(hw_backend):
  def load_dll(dll):
    try:
      return [ctypes.CDLL(dll, ctypes.RTLD_GLOBAL)]
    except OSError:
      return []

  def run(a, b, c):
    if hw_backend in ["chisel"]:
      hw_lib = '/tvm/vta/apps/tsim_example/hardware/chisel/build/libhw.so'
    else:
      hw_lib = '/tvm/vta/apps/tsim_example/hardware/verilog/build/libhw.so'
    sw_lib = '/tvm/vta/apps/tsim_example/build/libsw.so'
    load_dll(sw_lib)
    f = tvm.get_global_func("tvm.vta.driver")
    m = tvm.module.load(hw_lib, "vta-tsim")
    cycles = f(m, a, b, c)
    print("cycles:{}".format(cycles))
  return run

In [0]:
def test_accel(n, c, hw_backend):
    ctx = tvm.cpu(0)
    rmax = 64
    a = tvm.nd.array(np.random.randint(rmax, size=n).astype("uint64"), ctx)
    b = tvm.nd.array(np.zeros(n).astype("uint64"), ctx)
    f = tsim(hw_backend)
    f(a, b, c)
    for i, (x, y) in enumerate(zip(a.asnumpy(), b.asnumpy())):
      print("i:{0:<4} c:{1:<4} a:{2:<4} b:{3:<4}".format(i, c, x, y))

## Run Accelerator in Verilog

In [27]:
test_accel(5, 2, "verilog")

cycles:24
i:0    c:2    a:51   b:53  
i:1    c:2    a:57   b:59  
i:2    c:2    a:53   b:55  
i:3    c:2    a:60   b:62  
i:4    c:2    a:37   b:39  


## Run Accelerator in Chisel

In [28]:
test_accel(20, 40, "chisel")

cycles:99
i:0    c:40   a:0    b:40  
i:1    c:40   a:38   b:78  
i:2    c:40   a:21   b:61  
i:3    c:40   a:0    b:40  
i:4    c:40   a:55   b:95  
i:5    c:40   a:63   b:103 
i:6    c:40   a:58   b:98  
i:7    c:40   a:59   b:99  
i:8    c:40   a:57   b:97  
i:9    c:40   a:39   b:79  
i:10   c:40   a:46   b:86  
i:11   c:40   a:53   b:93  
i:12   c:40   a:4    b:44  
i:13   c:40   a:33   b:73  
i:14   c:40   a:4    b:44  
i:15   c:40   a:48   b:88  
i:16   c:40   a:20   b:60  
i:17   c:40   a:61   b:101 
i:18   c:40   a:30   b:70  
i:19   c:40   a:20   b:60  
