Neural Network Accelerator — Systolic Array in Verilog

A custom neural network accelerator built from scratch in Verilog, featuring a weight-stationary systolic array that performs matrix multiplication for neural network inference. The design runs MNIST handwritten digit classification entirely in RTL simulation, achieving 97.5% accuracy with Q8.8 fixed-point arithmetic.

Key Results

Metric	Value
Float baseline accuracy	97.6%
Q8.8 quantized accuracy	97.5% (0.1% drop)
RTL simulation accuracy	100% (100/100 test images)
Cycles per inference (4×4)	~273K
Total parameters	109,184
Array architecture	4×4 weight-stationary systolic
Arithmetic	Q8.8 signed fixed-point (16-bit)

Architecture

                    Weight Data (top)
                    ┌───┬───┬───┬───┐
                    │   │   │   │   │
                    ▼   ▼   ▼   ▼   ▼
              ┌─────┬───┬───┬───┬───┐
Activation ──►│     │PE │PE │PE │PE │──► (unused)
  Data     ──►│     │00 │01 │02 │03 │
  (left)   ──►│     ├───┼───┼───┼───┤
           ──►│     │PE │PE │PE │PE │──►
              │Array│10 │11 │12 │13 │
              │     ├───┼───┼───┼───┤
              │     │PE │PE │PE │PE │──►
              │     │20 │21 │22 │23 │
              │     ├───┼───┼───┼───┤
              │     │PE │PE │PE │PE │──►
              │     │30 │31 │32 │33 │
              └─────┴───┴───┴───┴───┘
                    │   │   │   │
                    ▼   ▼   ▼   ▼
                 Accumulator Outputs (bottom)
                        │
                        ▼
                    ┌────────┐
                    │  ReLU  │
                    └────┬───┘
                         │
                         ▼
                  ┌──────────────┐
                  │Output Buffer │
                  └──────────────┘

Dataflow: Weight-Stationary

In weight-stationary mode, weights are pre-loaded into PEs and remain fixed while activations stream through. This is the same dataflow used in the Google TPU v1.

Why weight-stationary?

Higher PE utilization than output-stationary (~100% vs ~50%)
Simpler control logic than row-stationary (Eyeriss)
Natural fit for fully-connected layers where weights are reused across batches
Proven at scale by Google's 256×256 TPU

Fixed-Point Arithmetic: Q8.8

Format: 1 sign bit + 7 integer bits + 8 fractional bits = 16 bits total
Range: [-128.0, +127.996]
Resolution: 1/256 ≈ 0.0039
Multiplication: 16×16 → 32-bit product, arithmetic right shift by 8
Accumulator: 32-bit to prevent overflow during dot products

Tiling for Large Matrices

Matrices larger than the 4×4 array are decomposed into tiles:

Matrix C[M×N] = A[M×K] × B[K×N]

For each output tile (mt, nt):
  Clear accumulators
  For each K tile (kt):
    Feed A_tile and B_tile through array with skewing
    Accumulate partial results
  Read and store results

Data skewing ensures correct timing: row i is delayed by i cycles, column j is delayed by j cycles. Total compute cycles per tile: 3N-2.

MNIST Network Architecture

Input (28×28 = 784 pixels)
        │
        ▼
┌───────────────┐
│  FC Layer 1   │  784 → 128 (100,352 weights)
│    + ReLU     │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  FC Layer 2   │  128 → 64 (8,192 weights)
│    + ReLU     │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  FC Layer 3   │  64 → 10 (640 weights)
│   (logits)    │
└───────┬───────┘
        │
        ▼
    argmax → digit (0-9)

Trained in PyTorch for 15 epochs with Adam optimizer. No bias terms — weights only.

Project Structure

nn-accelerator/
├── rtl/                          # Verilog RTL design
│   ├── mac_pe.v                  # MAC processing element (16-bit Q8.8)
│   ├── systolic_array.v          # 4×4 PE grid with generate blocks
│   ├── relu.v                    # Combinational ReLU activation
│   ├── weight_buffer.v           # Weight SRAM model
│   ├── input_buffer.v            # Input activation SRAM model
│   ├── output_buffer.v           # Result SRAM model
│   ├── control_unit.v            # FSM controller with tiling logic
│   └── accelerator_top.v         # Top-level module integration
├── tb/                           # Verilator C++ testbenches
│   ├── tb_mac_pe.cpp             # PE unit tests (12 tests)
│   ├── tb_relu.cpp               # ReLU unit tests (13 tests)
│   ├── tb_systolic.cpp           # Array matrix multiply tests (5 tests)
│   ├── tb_accelerator.cpp        # Full system tests with tiling (7 tests)
│   └── tb_mnist.cpp              # MNIST inference (100 images)
├── scripts/                      # Python utilities
│   ├── train_mnist.py            # PyTorch MNIST training
│   ├── quantize_weights.py       # Q8.8 quantization + accuracy validation
│   ├── gen_test_vectors.py       # Test matrix generation
│   └── validate_results.py       # Result validation
├── sim/                          # Simulation build
│   └── Makefile                  # Verilator build targets
├── data/                         # Generated data (not in git)
│   ├── mnist_weights/            # Quantized weight hex files
│   └── mnist_test/               # Test image hex files
├── docs/                         # Documentation
│   ├── architecture.md           # Detailed module specifications
│   └── performance.md            # Benchmarks and analysis
└── README.md

Building and Running

Prerequisites

brew install verilator           # Verilator 5.x
python3 -m venv .venv            # Python virtual environment
source .venv/bin/activate
pip install torch torchvision numpy

Run All Tests

cd sim

# Unit tests (fast)
make test_mac_pe        # MAC PE: 12 tests
make test_relu          # ReLU: 13 tests
make test_systolic      # Systolic array: 5 tests
make test_accelerator   # Full system: 7 tests

# MNIST inference (requires trained model)
cd .. && source .venv/bin/activate
python3 scripts/train_mnist.py       # Train model (~97% accuracy)
python3 scripts/quantize_weights.py  # Quantize to Q8.8, export hex
cd sim && make test_mnist            # Run inference on RTL

View Waveforms

make test_mac_pe         # Generates mac_pe.vcd
make waves_mac_pe        # Opens in GTKWave

Design Decisions

Why Weight-Stationary over Row-Stationary?

Row-stationary (Eyeriss) optimizes data reuse across weights, activations, and partial sums, achieving better energy efficiency for convolutions. However, it requires significantly more complex control logic (per-PE control, multicast networks, global buffer management). For our fully-connected MNIST workload, weight-stationary provides equivalent computational efficiency with much simpler hardware — a single broadcast of activations across rows, and sequential weight loading through columns.

Why Q8.8 over Q4.12?

Q4.12 offers 4× more fractional precision but limits the integer range to [-8, +8). Neural network activations after ReLU can exceed this range, especially in deeper layers. Q8.8 provides [-128, +128) integer range with 1/256 fractional resolution, which is more than sufficient for MNIST — as demonstrated by only 0.1% accuracy drop from float32.

Why No Bias Terms?

Omitting bias simplifies the hardware (no additional adder stage) while having minimal impact on accuracy for MNIST. The model learns compensating weight distributions during training. This is a common approach in hardware-oriented neural network design.

Tiling Implementation

When matrix dimensions exceed the array size, we tile the computation:

K-dimension tiles accumulate partial sums (accumulator NOT cleared between tiles)
M/N-dimension tiles are independent output blocks (accumulator cleared)
Zero-padding for dimensions not divisible by ARRAY_SIZE

Performance

Layer	Parameters	K Tiles	Cycles (4×4)	Cycles (8×8 est.)
FC1	100,352	196	~251K	~63K
FC2	8,192	32	~20K	~5K
FC3	640	16	~1.6K	~0.4K
Total	109,184		~273K	~68K

See docs/performance.md for detailed analysis.

Tech Stack

HDL: Verilog (Verilator-compatible subset)
Simulator: Verilator 5.046 (compiled C++ simulation)
Testbench: C++ (Verilator native API)
Training: Python 3 + PyTorch
Build: Make
Platform: macOS (Apple Silicon M-series)

References

Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (Google TPU v1)
Chen et al., "Eyeriss: An Energy-Efficient Reconfigurable Accelerator" (Row-stationary dataflow)
Project F, "Fixed-Point Numbers in Verilog" (Q-format implementation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Network Accelerator — Systolic Array in Verilog

Key Results

Architecture

Dataflow: Weight-Stationary

Fixed-Point Arithmetic: Q8.8

Tiling for Large Matrices

MNIST Network Architecture

Project Structure

Building and Running

Prerequisites

Run All Tests

View Waveforms

Design Decisions

Why Weight-Stationary over Row-Stationary?

Why Q8.8 over Q4.12?

Why No Bias Terms?

Tiling Implementation

Performance

Tech Stack

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
docs		docs
rtl		rtl
scripts		scripts
sim		sim
tb		tb
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Neural Network Accelerator — Systolic Array in Verilog

Key Results

Architecture

Dataflow: Weight-Stationary

Fixed-Point Arithmetic: Q8.8

Tiling for Large Matrices

MNIST Network Architecture

Project Structure

Building and Running

Prerequisites

Run All Tests

View Waveforms

Design Decisions

Why Weight-Stationary over Row-Stationary?

Why Q8.8 over Q4.12?

Why No Bias Terms?

Tiling Implementation

Performance

Tech Stack

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages