In [1]:
%set_env PATH=/home/subh/anaconda3/bin:/home/subh/tools/llvm-10/build/bin:/home/subh/tools/llvm-10/build/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
%cd ../..

env: PATH=/home/subh/anaconda3/bin:/home/subh/tools/llvm-10/build/bin:/home/subh/tools/llvm-10/build/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
/home/subh/research/hetsim-rel


# HetSim Demo
We will now demonstrate the use of HetSim for an example scenarios.

* We will first change an existing target model
* We will then write an application for it and run it through a) detailed simulation, and b) HetSim

## Install Dependencies
* Install LLVM (version > 10.0) by following instructions from https://llvm.org/docs/GettingStarted.html
* Install any cross-compilers required for the target
  * This example uses an Arm gcc that you can install by running `sudo apt install g++-arm-linux-gnueabihf`

## Build gem5
We will build gem5 by running a convenience script inside `scripts/`.

In [2]:
%cd scripts
!VERBOSE=0 CC=/usr/bin/gcc CXX=/usr/bin/g++ bash build-gem5.sh
%cd ..

/home/subh/research/hetsim-rel/scripts
[32m[1m[0]: Starting gem5 build for TimingSimpleCPU[m
[32m[1m[1]: gem5 build succeeded[m
[32m[1m[2]: Compiling m5threads library[m
make: '../pthread.o' is up to date.
[32m[1m[3]: build-gem5.sh successfully exiting[m
/home/subh/research/hetsim-rel


## Construct a gem5 Model for the Target
Consider a programmable target composed of two types of PEs - **worker** and **manager**.

<center>
    <img src="diagram.png" alt="example target" style="width: 500px;" align="center"/>
</center>

* All PEs share a D-Cache, a DSPM, and the main memory
* Each PE has a private instruction cache
* The manager distributes work to the workers via _FIFO queues_

### Tweak the Existing Model
We will make two changes to the existing model.
* Change number of workers from 4 &#8594; 16
* Change the depth of each work queue from 4 &#8594; 6

#### Step 1: Change Macros and Python Bindings

In [3]:
%cd example/model
# change the relevant define in params.h
!sed -i 's/#define NUM_WORKER.*/#define NUM_WORKER             8/g' params.h
!sed -i 's/#define WQ_DEPTH.*/#define WQ_DEPTH               6/g' params.h
# generate Python bindings
!make
%cd ../..

/home/subh/research/hetsim-rel/example/model
swig -python -module params params.h
gcc -c -fpic params_wrap.c -I/usr/include/python2.7 -o params_wrap.o
gcc -shared params_wrap.o -o _params.so
/home/subh/research/hetsim-rel


#### Step 2: Reflect Changes in User Spec and Libraries

* Assign PE IDs to the new worker PEs and queue IDs for the queues corresponding the new workers

In [4]:
# this step should be done manually!
!sed -i 's/\[1, 2, 3, 4\]/[1, 2, 3, 4, 5, 6, 7, 8]/g' spec/spec.json

* Generate code to connect the queues in the `gem5` model and the emulation and TRE libraries

In [5]:
%cd scripts
!python2 populate_init_queues.py
!VERBOSE=0 bash build-gem5.sh
%cd ..

/home/subh/research/hetsim-rel/scripts
[32m[1m[0]: Starting gem5 build for TimingSimpleCPU[m
[32m[1m[1]: gem5 build succeeded[m
[32m[1m[2]: Compiling m5threads library[m
make: '../pthread.o' is up to date.
[32m[1m[3]: build-gem5.sh successfully exiting[m
/home/subh/research/hetsim-rel


#### Step 3: Regenerate Compiler Plugin
Finally, we regenerate the compiler plugin and build the tracing library.

In [6]:
%cd scripts
!python generate_model.py ../spec/spec.json > /dev/null
%cd ../tracer
%mkdir -p build
%cd build
!cmake .. > /dev/null && make
%cd ../..

/home/subh/research/hetsim-rel/scripts
/home/subh/research/hetsim-rel/tracer
/home/subh/research/hetsim-rel/tracer/build
/usr/bin/ar: creating t.a
[35m[1mScanning dependencies of target LLVMHetsim[0m
[ 20%] [32mBuilding CXX object compiler-pass/CMakeFiles/LLVMHetsim.dir/hetsim-analysis.cpp.o[0m
[ 40%] [32mBuilding CXX object compiler-pass/CMakeFiles/LLVMHetsim.dir/hetsim-codegen.cpp.o[0m
[ 60%] [32m[1mLinking CXX shared module LLVMHetsim.so[0m
[ 60%] Built target LLVMHetsim
[35m[1mScanning dependencies of target hetsim_default_rt[0m
[ 80%] [32mBuilding CXX object runtime/default/CMakeFiles/hetsim_default_rt.dir/hetsim_default_rt.cpp.o[0m
[100%] [32m[1mLinking CXX shared library libhetsim_default_rt.so[0m
[100%] Built target hetsim_default_rt
/home/subh/research/hetsim-rel


## Write Application for the Target
We will write a program to do **vector addition** on the target hardware.

### Header Files
```Cpp
#include "params.h"  // import parameters of target hardware as macros
#include "util.h"    // import primitive definitions
#include <pthread.h>
#include <sys/mman.h>
```

### Boilerplate Initialization
```Cpp

void *work(void *arg) { // manager "spawns" worker threads with tid=1,2,3...
    unsigned tid = *(unsigned *)(arg);
    __register_core_id(tid);
    ...
}
int main() {
    __init_queues(WQ_DEPTH);
    __register_core_id(0); // manager is assigned core-id 0
    ...
    __teardown_queues();
    return 0;
}
```

                           
                           

#### Note
The `core_id` must be the same across the application, `spec.json` and `target.py`.

In [7]:
# spec.json
!grep -A1 -B1 "id" spec/spec.json

        "mgr": {
            "id": [0],
            "__push(unsigned int, unsigned long)" : {
--
        "wrkr": {
            "id": [1, 2, 3, 4, 5, 6, 7, 8],
            "__pop(unsigned int)": {


In [8]:
# target.py
!grep -A1 -B1 "id=" example/model/target.py

    system.mgr = TRE(
        id=0,   # This must correspond to the ID assigned in the user spec file.
        queue_depth=WQ_DEPTH,
--
        wrkr.append(TRE(
            id=i+1, # This must correspond to the ID assigned in the user spec file.
            max_outstanding_addrs=MAX_OUTSTANDING_REQS


### Manager PE code: `main()`
Memory allocation is done at the beginning part of `main()`.

```Cpp
    // main memory allocation
    // in this example, we are working with 3 float arrays each of size N
    size_t RAM_SIZE_BYTES = 3 * N * sizeof(float);
    char *ram = (char *)mmap((void *)(RAM_BASE_ADDR), RAM_SIZE_BYTES,
                             PROT_READ | PROT_WRITE | PROT_EXEC, 
                             MAP_ANON | MAP_PRIVATE, 0, 0);
```

```Cpp
    // scratchpad memory allocation
#ifdef EMULATION
    // for emulation
    char *dspm = (char *)mmap((void *)(SPM_BASE_ADDR), SPM_SIZE_BYTES,
                              PROT_READ | PROT_WRITE | PROT_EXEC, 
                              MAP_ANON | MAP_PRIVATE, 0, 0);
#else  // !EMULATION
    // the model uses physically-addressed scratchpad that does not need explicit allocation
    char *dspm = (char *)SPM_BASE_ADDR;
#endif // EMULATION
```

We "allocate" the input and output vectors at the pre-allocated main memory and initialize them.
```Cpp
    // allocate the vectors and populate them
    float *a = (float *)(ram);
    float *b = (float *)(ram + N * sizeof(float));
    float *c = (float *)(ram + 2 * N * sizeof(float));
    for (int i = 0; i < N; ++i) {
        a[i] = float(i + 1);
        b[i] = float(i + 1);
        c[i] = 0.0;
    }
```

For illustration, we allocate a barrier in the shared DSPM.
```Cpp
    // allocate barrier object for synchronization
    pthread_barrier_t *bar = (pthread_barrier_t *)(dspm);
    // initialize barrier with participants = 1 manager + NUM_WORKER workers
    __barrier_init(bar, NUM_WORKER + 1); 
```

Next, we allocate and spawn the worker threads.
```Cpp
    // allocate thread objects for each "worker" PE
    pthread_t *workers = new pthread_t[NUM_WORKER];
```

```Cpp
    // create vector of core IDs to send to each thread
    unsigned *tids = new unsigned[NUM_WORKER];
    for (int i = 0; i < NUM_WORKER; ++i) {
        tids[i] = i + 1;
        // spawn worker thread
        pthread_create(workers + i, NULL, work, &tids[i]);
    }
```

The most important part of the code for the manager -- distribute and push work to the workers!
```Cpp
    // partition the work and push work "packets"
    for (int i = 0; i < NUM_WORKER; ++i) {
    // each worker is assigned floor(N / NUM_WORKER) elements
        int n = N / NUM_WORKER;
        int start_idx = i * n;
        int end_idx = (i + 1) * n - 1;

        // handle trailing elements by assigning to final worker
        if (i == NUM_WORKER - 1) {
            end_idx = N - 1;
        }
```

```Cpp
        // push through work queues
        __push(i + 1, (uintptr_t)(a));
        __push(i + 1, (uintptr_t)(b));
        __push(i + 1, (uintptr_t)(c));
        __push(i + 1, (unsigned)(start_idx));
        __push(i + 1, (unsigned)(end_idx));
        __push(i + 1, (uintptr_t)(bar));
    }
```

```Cpp
// ----- ROI begin -----
    __reset_stats(); // begin recording time here
    for (int i = 0; i < NUM_WORKER; ++i) {
        __push(i + 1, 0); // start signal, value is ignored
    }
    __barrier_wait(bar); // synchronize with worker threads

    __dump_reset_stats(); // end recording time here
// ----- ROI end -----
```

```Cpp
    // join with all threads
    for (int tid = 0; tid < NUM_WORKER; ++tid) {
        pthread_join(workers[tid], NULL);
    }
```

```Cpp
    // clean up
#ifdef EMULATION
    munmap(dspm, SPM_SIZE_BYTES);
#endif // EMULATION
    munmap(ram, RAM_SIZE_BYTES);
    delete[] workers;
    delete[] tids;
    __teardown_queues();
} // end of main()
```

### Worker PE Code: `work()`

```Cpp
void *work(void *arg) {
    unsigned tid = *(unsigned *)(arg);
    __register_core_id(tid);

    // retrieve variables from work queue
    volatile float *a = (volatile float *)__pop(0);
    volatile float *b = (volatile float *)__pop(0);
    volatile float *c = (volatile float *)__pop(0);
    int start_idx = (int)__pop(0);
    int end_idx = (int)__pop(0);
    pthread_barrier_t *bar = (pthread_barrier_t *)__pop(0);
```

```Cpp
    // receive start signal
    __pop(0);

    // perform actual computation
    for (int i = start_idx; i <= end_idx; ++i) {
        c[i] += a[i] + b[i];
    }

    // synchronize with manager
    __barrier_wait(bar);

    return NULL;
} // end of work()
```

## Verify Functionality of Emulated Code

In [9]:
# build emulator library
%cd emu
!mkdir -p build
%cd build
!cmake .. > /dev/null && make
%cd ../..

/home/subh/research/hetsim-rel/emu
/home/subh/research/hetsim-rel/emu/build
[35m[1mScanning dependencies of target hetsim_prim[0m
[ 50%] [32mBuilding CXX object CMakeFiles/hetsim_prim.dir/src/util.cpp.o[0m
[100%] [32m[1mLinking CXX shared library libhetsim_prim.so[0m
[100%] Built target hetsim_prim
/home/subh/research/hetsim-rel


In [10]:
# build application with emulation library
%cd example/app
%rm -rf build
%mkdir -p build
%cd build
!CC=/usr/bin/gcc CXX=/usr/bin/g++ MODE=EMU cmake .. > /dev/null && make
%cd ../../..

/home/subh/research/hetsim-rel/example/app
/home/subh/research/hetsim-rel/example/app/build
MODE set to EMU
Processing application: serial_factorial
Processing application: vector_add
Processing application: workq_mutex
[35m[1mScanning dependencies of target workq_mutex[0m
[ 16%] [32mBuilding CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o[0m
[ 33%] [32m[1mLinking CXX executable workq_mutex[0m
[ 33%] Built target workq_mutex
[35m[1mScanning dependencies of target serial_factorial[0m
[ 50%] [32mBuilding CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o[0m
[ 66%] [32m[1mLinking CXX executable serial_factorial[0m
[ 66%] Built target serial_factorial
[35m[1mScanning dependencies of target vector_add[0m
[ 83%] [32mBuilding CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o[0m
[100%] [32m[1mLinking CXX executable vector_add[0m
[100%] Built target vector_add
/home/subh/research/hetsim-rel


In [11]:
# run emulated application for functional verification
%cd example/app/build
!./vector_add
%cd ../../..

/home/subh/research/hetsim-rel/example/app/build
== Vector Add Test with N = 100000, NUM_WORKER = 8
== Test Passed ==
/home/subh/research/hetsim-rel


## Simulation on Detailed Model

In [12]:
%set_env CMAKE_C_COMPILER=/usr/bin/arm-linux-gnueabihf-gcc
%set_env CMAKE_CXX_COMPILER=/usr/bin/arm-linux-gnueabihf-g++

env: CMAKE_C_COMPILER=/usr/bin/arm-linux-gnueabihf-gcc
env: CMAKE_CXX_COMPILER=/usr/bin/arm-linux-gnueabihf-g++


In [25]:
# build application with emulation library
%cd example/app
%rm -rf build
%mkdir -p build
%cd build
!MODE=SIM cmake .. > /dev/null && make
%cd ../../../

/home/subh/research/hetsim-rel/example/app
/home/subh/research/hetsim-rel/example/app/build
MODE set to SIM
Processing application: serial_factorial
Processing application: vector_add
Processing application: workq_mutex
[35m[1mScanning dependencies of target m5threads[0m
[ 12%] [32mBuilding C object CMakeFiles/m5threads.dir/home/subh/research/hetsim-rel/m5threads/pthread.c.o[0m
[01m[K/home/subh/research/hetsim-rel/m5threads/pthread.c:[m[K In function ‘[01m[Kpthread_create[m[K’:
   clone(__pthread_trampoline, tcb->stack_start_addr, CLONE_VM|CLONE_FS|CLONE_FI
[01;32m[K   ^[m[K
[ 25%] [32m[1mLinking C static library libm5threads.a[0m
[ 25%] Built target m5threads
[35m[1mScanning dependencies of target serial_factorial[0m
[ 37%] [32mBuilding CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o[0m
[ 50%] [32m[1mLinking CXX executable serial_factorial[0m
[ 50%] Built target serial_factorial
[35m[1mScanning dependencies of target workq_mutex[0m


In [27]:
# run gem5 simulation
%cd scripts
!MODE=SIM APP=vector_add bash run-gem5.sh > /dev/null
!echo "..." && tail -n20 ../gem5/m5out/run.log
%cd ../
### Gather Runtime
%mkdir -p res
%cp gem5/m5out/stats.txt res/stats.det.txt
!grep sim_ticks res/stats.det.txt | head -n1 | tr -s ' ' | cut -d' ' -f2 > ticks.det.txt

/home/subh/research/hetsim-rel/scripts
...
== Vector Add Test with N = 100000, NUM_WORKER = 8
warn: PowerState: Already in the requested power state, request ignored
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
warn: User mode does not have SPSR
== Test Passed ==
Exiting @ tick 9534907382 because exiting with last active thread context
/home/subh/research/hetsim-rel


In [16]:
%set_env CMAKE_C_COMPILER=/usr/bin/gcc
%set_env CMAKE_CXX_COMPILER=/usr/bin/g++

env: CMAKE_C_COMPILER=/usr/bin/gcc
env: CMAKE_CXX_COMPILER=/usr/bin/g++


## Run Trace Generation
We will first make the required changes to enable tracing in the CPP program that we wrote and then run the trace generation step.

```Cpp
#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
#include "hetsim_default_rt.h"
#endif

void *work(void *arg) {
    unsigned tid = *(unsigned *)(arg);
    __register_core_id(tid);
#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
    __open_trace_log(tid);
#endif // AUTO_TRACING || MANUAL_TRACING
    // ROI begin
    ...
    // ROI end
#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
    __close_trace_log(tid);
#endif // AUTO_TRACING || MANUAL_TRACING
}

```

```Cpp
int main() {
    ...
#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
    __open_trace_log(0); // use core-id as argument
#endif
    ...
    __barrier_init(bar, NUM_WORKER + 1); // set number of participants to 1 manager + NUM_WORKER worker PEs
    ... 
    __dump_reset_stats(); // end recording time here
    // ----- ROI end -----

#if defined(AUTO_TRACING) || defined(MANUAL_TRACING)
    __close_trace_log(0);
#endif // AUTO_TRACING || MANUAL_TRACING
    ...
} // end of main()
```

In [17]:
# run trace generation
%cd example/app/build
!mkdir -p traces
%rm -f CMakeCache.txt
!MODE=EMU_AUTO_TRACE cmake .. > /dev/null && make
!./vector_add

/home/subh/research/hetsim-rel/example/app/build
MODE set to EMU_AUTO_TRACE
Processing application: serial_factorial
Processing application: vector_add
Processing application: workq_mutex
[35m[1mScanning dependencies of target workq_mutex[0m
[ 16%] [32mBuilding CXX object CMakeFiles/workq_mutex.dir/src/workq_mutex.cpp.o[0m
#pragma message("Tracing enabled for this run")
[0;1;32m        ^
[ 33%] [32m[1mLinking CXX executable workq_mutex[0m
[ 33%] Built target workq_mutex
[35m[1mScanning dependencies of target serial_factorial[0m
[ 50%] [32mBuilding CXX object CMakeFiles/serial_factorial.dir/src/serial_factorial.cpp.o[0m
#pragma message("Tracing enabled for this run")
[0;1;32m        ^
[ 66%] [32m[1mLinking CXX executable serial_factorial[0m
[ 66%] Built target serial_factorial
[35m[1mScanning dependencies of target vector_add[0m
[ 83%] [32mBuilding CXX object CMakeFiles/vector_add.dir/src/vector_add.cpp.o[0m
[100%] [32m[1mLinking CXX executable vector_add[0m
[1

In [18]:
### Trace Format
!head -n10 traces/pe_[01].trace  
%cd ../../../

==> traces/pe_0.trace <==
BARINIT 0xe0101000  9 10
ST @11 0x1dfdf60 (  )
ST @12 0x1dfdf64 (  )
ST @13 0x1dfdf68 (  )
ST @14 0x1dfdf6c (  )
ST @15 0x1dfdf70 (  )
ST @16 0x1dfdf74 (  )
ST @17 0x1dfdf78 (  )
ST @18 0x1dfdf7c (  )
PUSH 1 1

==> traces/pe_1.trace <==
POP 0 1
POP 0 1
POP 0 1
POP 0 1
POP 0 1
POP 0 1
POP 0 1
STALL 3 ( )
LD @1 0x40000000 (  )
LD @2 0x40061a80 ( 0x40000000  )
/home/subh/research/hetsim-rel


## Run Trace Replay
We will now run the generated traces through the TRE-enabled `gem5` model.

In [24]:
%cd scripts
!MODE=EMU_TRACE APP=vector_add bash run-gem5.sh > /dev/null
!echo "..." && tail -n20 ../gem5/m5out/run.log
%cd ..

/home/subh/research/hetsim-rel/scripts
...
TRE[6]: halted @694256563 after completing 100005 trace entries
Number of TREs IDLE now: 1/9
TRE[8]: halted @694258565 after completing 100005 trace entries
Number of TREs IDLE now: 2/9
TRE[1]: halted @694260567 after completing 100005 trace entries
Number of TREs IDLE now: 3/9
TRE[3]: halted @694262569 after completing 100005 trace entries
Number of TREs IDLE now: 4/9
TRE[2]: halted @694264571 after completing 100005 trace entries
Number of TREs IDLE now: 5/9
TRE[7]: halted @694266573 after completing 100005 trace entries
Number of TREs IDLE now: 6/9
TRE[5]: halted @694268575 after completing 100005 trace entries
Number of TREs IDLE now: 7/9
TRE[4]: halted @694270577 after completing 100005 trace entries
Number of TREs IDLE now: 8/9
TRE[0]: triggered DMPRST
TRE[0]: halted @694273580 after completing 0 trace entries
Number of TREs IDLE now: 9/9
Exiting @ tick 694273580 because all TREs are done
/home/subh/research/hetsim-rel


#### Comparison between Detailed and HetSim Runs
As the final step, we will compare the runtime between the detailed `gem5` run and the TRE-enabled `gem5` run.

In [20]:
%cp gem5/m5out/stats.txt res/stats.tre.txt
!grep sim_ticks res/stats.tre.txt | head -n1 | tr -s ' ' | cut -d' ' -f2 > ticks.tre.txt
!cat ticks.det.txt
!cat ticks.tre.txt

752808056
694044351


> This Jupyter Notebook is available in the public repository: https://github.com/umich-cadre/HetSim-gem5/demos/iiswc-20/tutorial.ipynb