<a href="https://colab.research.google.com/github/SiliconJackets/sscs-ose-code-a-chip.github.io/blob/main/ISSCC25/submitted_notebooks/SJHDComputing/HyperDimensionalComputing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Hyperdimensional Computing With Openlane

```
Copyright 2024 SiliconJackets @ Georgia Institute of Technology
SPDX-License-Identifier: GPL-3.0-or-later
```

Disentanglement of hyperdimenisional vector image representations inpsired by [1](https://www.nature.com/articles/s41565-023-01357-8#Sec1) using the [OpenLane](https://github.com/The-OpenROAD-Project/OpenLane/) GDS to RTL flow targeting the [open source SKY130 PDK](https://github.com/google/skywater-pdk/).

|Name|Affiliation| Email |IEEE Member|SSCS Member|
|:--:|:----------:|:----------:|:----------:|:----------:|
|Jack Cochran|Georgia Institute of Technology|jcochran66@gatech.edu|No|No|
|Sowmya Janapati|Georgia Institute of Technology|jsowmya@gatech.edu|No|No|
|Minseung Jung|Georgia Institute of Technology|mjung76@gatech.edu|No|No|
|Jackie MacHale|Georgia Institute of Technology|jackiemachale@gatech.edu|No|No|
|Nealson Li|Georgia Institute of Technology|nealson@gatech.edu|Yes|Yes|
|Zachary Ellis|Georgia Institute of Technology|zellis7@gatech.edu|Yes|Yes|

This notebook goes through the process of design specification, simulation, and implementation of  with open-source tools and PDKs. The parallel computation and data reuse ability of a systolic array is crucial for the acceleration of neural networks, and this notebook with the reusable design aims to contribute to the hardware open-source community to enable more efficient ML applications. This project will explain the principles behind how a systolic array operates 2D convolution, demonstrate the performance of our implementation with image results, and show the final GDS generated with open-source flow. Additionally, to further demonstrate the feasibility of the open-source flow and our design, we are also submitting this systolic array design to the open-source silicon initiative, [Tiny Tapeout](https://tinytapeout.com/). This submission is completed by members of SiliconJackets. We are a student run organization at Georgia Tech that introduces students to semiconductor design, verification, and implementation through a large collaborative project. We are hoping to use this notebook as an example for future members of the club.

## Introduction
---

In this noteboook, we will first explain what a systolic array is and its application by referencing the row stationary data flow introduced in [EYERISS](https://courses.cs.washington.edu/courses/cse550/21au/papers/CSE550.Eyeriss.pdf), which our design is losely based on. Then, the hardware specification and design of the high level architecture and processing unit are explained. We will then demonstrate the performance by simulating the hardware design to perform convolution for an edge detection task, and verify it with the software golden referrence. Lastly, the systolic array is pushed through [OpenLane](https://github.com/The-OpenROAD-Project/OpenLane/) RTL to GDS flow with the open-source [SKY130 PDK](https://github.com/google/skywater-pdk/).

## Systolic Array
---

### What is a Systolic Array

A systolic array is a 2D array of individual processing elements, which can each independently compute. Systolic arrays allow massive parallel computation which is largely useful for machine learning applications which can require a large number of MACs (Multiply and Accumulates) that are not dependent on one another. This array construction not only allows the massive parallel computation abilities, but also facilitates data reuse which reduces the memory bottleneck. By scheduling operations correctly for something like matrix multiplication of 2D+ convolution, PEs that are next to each other may use similar data in their operations. Because of the PE arrangement, this data can be passed between PEs directly, which means it only has to be fetched once from memory. In this notebook we present a 3x3 systolic array which uses row stationary dataflow and show how it passes data between PEs for maximum data reuse and minimum required bandwidth.

### Convolutions and Systolic Arrays
In the realm of signal processing and machine learning, convolution plays a fundamental role in various applications such as image processing, video processing, and digital filtering. A two-dimensional convolution (Conv2D) is a mathematical operation involving sliding a filter matrix over a larger input matrix to produce an output, which is a fundamental operation in many algorithms, including those employed in computer vision and machine learning applications.
The convolution operation naturally allows for significant data reuse, as any value from the input, filter, or output matrices will be used many times in different multiplication/addition calculations (MACs). The key to exploiting this opportunity for efficient convolutions is to use highly parallel hardware to reuse data loaded from memory as much as possible before it's returned to memory.
In a systolic array, data is loaded from memory and flows through a grid of identical processing elements (PEs), being reused differently in each PE over different clock cycles. The flow of data through the system can be compared to the flow of blood being pumped through the circulatory system.
Systolic arrays are very useful for matrix multiplication (GEMM), and before the row-stationary (RS) dataflow was used, convolution operations were converted into large GEMM operations before they could flow through the array of PEs. However, with the modified RS dataflow which we implemented in our design, the systolic-like array directly computes the Conv2D of the input and filter matrix efficiently.

#### Row Stationary Dataflow

In a row stationary dataflow, the individual processing elements in a systolic array each have small amounts of scratchpad memory which is devoted to keep row value data in place while it is operated on. In this mode, each processing element computes a single output from a 1D convolution operation and then those partial sums are added down the columns for the final outputs. During the initial loading of the filter weights and row data, the full scratchpads need to be populated before any computation can occur, but as the convolution operation moves across the rows, only one new byte of data needs to be read per PE making this form of 2D convolution operation very memory efficient.

<div>
<img src="https://github.com/sscs-ose/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/accepted_notebooks/SJSystolicArray/img/systolic_array_flow.gif?raw=1" width="1000"/>
</div>

<!-- ![Flow](https://github.com/SiliconJackets/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/submitted_notebooks/SJSystolicArray/img/systolic_array_flow.gif?raw=true){: width=250} -->

#### Applications

The main application for row stationary systolic arrays is 2D convolution. A convolution operation applies a filter kernel to a 2D input (for example an image) which then transforms the image to pull out specific details. A convolution may be able to pick out the edges of objects as shown in this notebook, or a chain of convolutions such as in a convolutional neural network may be able to filter out more complex shapes for object recognition or something like a dog. Using a systolic array to do 2D convolution is very quick and efficient which is why this hardware is the basis for many machine learning accelerators.

### How is the hardware designed?

In order to show off the high memory efficiency of row stationary dataflow, the external memory connections for the top level of this design are very limited. With 2 read ports and 1 write port, this design is only able to read in 16 bits of data each cycle and write 8 bits. However, this data is reused across PEs allowing up to 9 MACs a cycle with different data combinations.

<div>
<img src="https://github.com/sscs-ose/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/accepted_notebooks/SJSystolicArray/img/Top.png?raw=1" width="1000"/>
</div>

<!-- ![Flow](https://github.com/SiliconJackets/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/submitted_notebooks/SJSystolicArray/img/Top.png?raw=true){: width=250} -->

#### Top Level Design

The top-level controller is responsible for controlling the timing of data read and operation start for all the PEs. Taking in the size of the input from the memory interface on the first cycle, the top-level controller then schedules the control signals for the individual PEs to read the data on the memory bus when it is their turn. When a PE has the data it needs, and it is it's turn in the sequence to start it's 1D convolution, the top-level controller asserts the start signal for that PE. Because of the staggering of start times, the state machines inside the PEs will run such that the data is automatically summed up the column of the PE and only one result is available for writing at a time.

<div>
<img src="https://github.com/sscs-ose/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/accepted_notebooks/SJSystolicArray/img/Ctrl.png?raw=1" width="1000"/>
</div>

<!-- ![Flow](https://github.com/SiliconJackets/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/submitted_notebooks/SJSystolicArray/img/Ctrl.png?raw=true){: width=250} -->

#### PE Design

In order to reduce complexity and area the control structure inside each PE is kept very simple. When the PE sees a control signal to read in a new input or filter value from the top-level controller, it will read in a new value into the scratchpad and shift existing values over evicting the oldest value (with a depth of 3). Once the PE sees a start signal it will spend 3 cycles doing MACs with the scratchpad values and then sum with the input psum. With the PE start signals staggered across the array, the psum_o for one PE in a column becomes psum_i for the PE above it with the top PE presenting a final value at the output. These PEs always rely on the correct data being present at the correct time which is possible with the scheduling of the memory transactions and top-level controller.

<div>
<img src="https://github.com/sscs-ose/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/accepted_notebooks/SJSystolicArray/img/PE.png?raw=1" width="1000"/>
</div>

<!-- ![Flow](https://github.com/SiliconJackets/sscs-ose-code-a-chip.github.io/blob/main/VLSI24/submitted_notebooks/SJSystolicArray/img/PE.png?raw=true){width=250} -->

In [None]:
#@title Install Dependencies {display-mode: "form"}
#@markdown Click the ▷ button to setup the simulation environment.

#@markdown Main components we will install

#@markdown *   verilator : a free and open-source software tool which converts Verilog (a hardware description language) to a cycle-accurate behavioral model in C++ or SystemC.
#@markdown *   pytorch : Used to format input data for the systolic array from the image files and do edge detection in software for the golden reference
#@markdown *   opencv : Used for input image manipulation
#@markdown *   fxpmath : This module helps emulate the floating point math behavior of our systolic array

%load_ext autoreload
%autoreload 2
!apt-get install verilator
!pip install torch
!pip install torchvision
!pip install opencv-python
!pip install fxpmath
!pip install numpy

### RTL2GDS Flow

In [None]:
#@markdown We need to remove the previously installed version of Verilator and also install libparse in order for OpenLane to function properly. In order for everything to run the first time in the notebook we will also need to restart the runtime. Once you click the ▷ button for this cell, at the bottom it will prompt you **Once deleted, variables cannot be recovered. Proceed (y/[n])?** Please type y
!apt remove -y verilator
!pip install libparse
%reset

In [None]:
#@title Install Dependencies {display-mode: "form"}
#@markdown Click the ▷ button to setup the digital design environment based on [conda-eda](https://github.com/hdl/conda-eda).

#@markdown Main components we will install

#@markdown *   Open_pdks.sky130a : a PDK installer for open-source EDA tools.
#@markdown *   Openlane : an automated RTL to GDSII flow based on several components including OpenROAD, Yosys, Magic, Netgen, CVC, SPEF-Extractor, KLayout and a number of custom scripts for design exploration and optimization.
#@markdown *   GDSTK : a C++ library for creation and manipulation of GDSII and OASIS files.

!apt remove -y verilator
#openlane_version = 'custom_set' #@param {type:"string"}
#open_pdks_version = 'custom_set' #@param {type:"string"}

#if openlane_version == 'latest':
#  openlane_version = ''
#if open_pdks_version == 'latest':
#  open_pdks_version = ''

import os
import pathlib

!curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xj bin/micromamba
conda_prefix_path = pathlib.Path('conda-env')
CONDA_PREFIX = str(conda_prefix_path.resolve())
!bin/micromamba create --yes --prefix $CONDA_PREFIX
!echo 'python ==3.7*' >> {CONDA_PREFIX}/conda-meta/pinned
!CI=0 bin/micromamba install --yes --prefix $CONDA_PREFIX \
                     --channel litex-hub \
                     --channel main \
                     openlane={"2023.11.03_0_gf4f8dad8"} \
                     open_pdks.sky130a={"1.0.458_0_g8c68aca"} \
                     openroad={"2.0_10927_g0922eecb9"} \
                     verilator={"5.018_57_ga022b672a"}
!bin/micromamba install --quiet \
                        --yes \
                        --prefix $CONDA_PREFIX \
                        --channel conda-forge \
                        --channel main \
                        gdstk

!pip install libparse libparse
PATH = os.environ['PATH']
%env CONDA_PREFIX={CONDA_PREFIX}
%env PATH={CONDA_PREFIX}/bin:{PATH}
#%reset

In [None]:
%%writefile config.json
{
    "DESIGN_NAME": "top",
    "VERILOG_FILES": "dir::SystolicArray/src/*.sv",
    "CLOCK_PERIOD": 40,
    "CLOCK_NET": "clk",
    "CLOCK_PORT": "clk",

    "FP_SIZING": "absolute",
    "DIE_AREA": "0 0 480 200",
    "PL_TARGET_DENSITY": 0.8
}

### Run Flow
In the event that the flow fails due to a verilator (linter) or libparse (on step 34) error please restart runtime and rerun install dependencies. Just re-running install dependencies may work as well

In [None]:
%env PDK=sky130A
!flow.tcl -design .