A custom neural network accelerator built from scratch in Verilog, featuring a weight-stationary systolic array that performs matrix multiplication for neural network inference. The design runs MNIST handwritten digit classification entirely in RTL simulation, achieving 97.5% accuracy with Q8.8 fixed-point arithmetic.
| Metric | Value |
|---|---|
| Float baseline accuracy | 97.6% |
| Q8.8 quantized accuracy | 97.5% (0.1% drop) |
| RTL simulation accuracy | 100% (100/100 test images) |
| Cycles per inference (4×4) | ~273K |
| Total parameters | 109,184 |
| Array architecture | 4×4 weight-stationary systolic |
| Arithmetic | Q8.8 signed fixed-point (16-bit) |
Weight Data (top)
┌───┬───┬───┬───┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────┬───┬───┬───┬───┐
Activation ──►│ │PE │PE │PE │PE │──► (unused)
Data ──►│ │00 │01 │02 │03 │
(left) ──►│ ├───┼───┼───┼───┤
──►│ │PE │PE │PE │PE │──►
│Array│10 │11 │12 │13 │
│ ├───┼───┼───┼───┤
│ │PE │PE │PE │PE │──►
│ │20 │21 │22 │23 │
│ ├───┼───┼───┼───┤
│ │PE │PE │PE │PE │──►
│ │30 │31 │32 │33 │
└─────┴───┴───┴───┴───┘
│ │ │ │
▼ ▼ ▼ ▼
Accumulator Outputs (bottom)
│
▼
┌────────┐
│ ReLU │
└────┬───┘
│
▼
┌──────────────┐
│Output Buffer │
└──────────────┘
In weight-stationary mode, weights are pre-loaded into PEs and remain fixed while activations stream through. This is the same dataflow used in the Google TPU v1.
Why weight-stationary?
- Higher PE utilization than output-stationary (~100% vs ~50%)
- Simpler control logic than row-stationary (Eyeriss)
- Natural fit for fully-connected layers where weights are reused across batches
- Proven at scale by Google's 256×256 TPU
- Format: 1 sign bit + 7 integer bits + 8 fractional bits = 16 bits total
- Range: [-128.0, +127.996]
- Resolution: 1/256 ≈ 0.0039
- Multiplication: 16×16 → 32-bit product, arithmetic right shift by 8
- Accumulator: 32-bit to prevent overflow during dot products
Matrices larger than the 4×4 array are decomposed into tiles:
Matrix C[M×N] = A[M×K] × B[K×N]
For each output tile (mt, nt):
Clear accumulators
For each K tile (kt):
Feed A_tile and B_tile through array with skewing
Accumulate partial results
Read and store results
Data skewing ensures correct timing: row i is delayed by i cycles, column j is delayed by j cycles. Total compute cycles per tile: 3N-2.
Input (28×28 = 784 pixels)
│
▼
┌───────────────┐
│ FC Layer 1 │ 784 → 128 (100,352 weights)
│ + ReLU │
└───────┬───────┘
│
▼
┌───────────────┐
│ FC Layer 2 │ 128 → 64 (8,192 weights)
│ + ReLU │
└───────┬───────┘
│
▼
┌───────────────┐
│ FC Layer 3 │ 64 → 10 (640 weights)
│ (logits) │
└───────┬───────┘
│
▼
argmax → digit (0-9)
Trained in PyTorch for 15 epochs with Adam optimizer. No bias terms — weights only.
nn-accelerator/
├── rtl/ # Verilog RTL design
│ ├── mac_pe.v # MAC processing element (16-bit Q8.8)
│ ├── systolic_array.v # 4×4 PE grid with generate blocks
│ ├── relu.v # Combinational ReLU activation
│ ├── weight_buffer.v # Weight SRAM model
│ ├── input_buffer.v # Input activation SRAM model
│ ├── output_buffer.v # Result SRAM model
│ ├── control_unit.v # FSM controller with tiling logic
│ └── accelerator_top.v # Top-level module integration
├── tb/ # Verilator C++ testbenches
│ ├── tb_mac_pe.cpp # PE unit tests (12 tests)
│ ├── tb_relu.cpp # ReLU unit tests (13 tests)
│ ├── tb_systolic.cpp # Array matrix multiply tests (5 tests)
│ ├── tb_accelerator.cpp # Full system tests with tiling (7 tests)
│ └── tb_mnist.cpp # MNIST inference (100 images)
├── scripts/ # Python utilities
│ ├── train_mnist.py # PyTorch MNIST training
│ ├── quantize_weights.py # Q8.8 quantization + accuracy validation
│ ├── gen_test_vectors.py # Test matrix generation
│ └── validate_results.py # Result validation
├── sim/ # Simulation build
│ └── Makefile # Verilator build targets
├── data/ # Generated data (not in git)
│ ├── mnist_weights/ # Quantized weight hex files
│ └── mnist_test/ # Test image hex files
├── docs/ # Documentation
│ ├── architecture.md # Detailed module specifications
│ └── performance.md # Benchmarks and analysis
└── README.md
brew install verilator # Verilator 5.x
python3 -m venv .venv # Python virtual environment
source .venv/bin/activate
pip install torch torchvision numpycd sim
# Unit tests (fast)
make test_mac_pe # MAC PE: 12 tests
make test_relu # ReLU: 13 tests
make test_systolic # Systolic array: 5 tests
make test_accelerator # Full system: 7 tests
# MNIST inference (requires trained model)
cd .. && source .venv/bin/activate
python3 scripts/train_mnist.py # Train model (~97% accuracy)
python3 scripts/quantize_weights.py # Quantize to Q8.8, export hex
cd sim && make test_mnist # Run inference on RTLmake test_mac_pe # Generates mac_pe.vcd
make waves_mac_pe # Opens in GTKWaveRow-stationary (Eyeriss) optimizes data reuse across weights, activations, and partial sums, achieving better energy efficiency for convolutions. However, it requires significantly more complex control logic (per-PE control, multicast networks, global buffer management). For our fully-connected MNIST workload, weight-stationary provides equivalent computational efficiency with much simpler hardware — a single broadcast of activations across rows, and sequential weight loading through columns.
Q4.12 offers 4× more fractional precision but limits the integer range to [-8, +8). Neural network activations after ReLU can exceed this range, especially in deeper layers. Q8.8 provides [-128, +128) integer range with 1/256 fractional resolution, which is more than sufficient for MNIST — as demonstrated by only 0.1% accuracy drop from float32.
Omitting bias simplifies the hardware (no additional adder stage) while having minimal impact on accuracy for MNIST. The model learns compensating weight distributions during training. This is a common approach in hardware-oriented neural network design.
When matrix dimensions exceed the array size, we tile the computation:
- K-dimension tiles accumulate partial sums (accumulator NOT cleared between tiles)
- M/N-dimension tiles are independent output blocks (accumulator cleared)
- Zero-padding for dimensions not divisible by ARRAY_SIZE
| Layer | Parameters | K Tiles | Cycles (4×4) | Cycles (8×8 est.) |
|---|---|---|---|---|
| FC1 | 100,352 | 196 | ~251K | ~63K |
| FC2 | 8,192 | 32 | ~20K | ~5K |
| FC3 | 640 | 16 | ~1.6K | ~0.4K |
| Total | 109,184 | ~273K | ~68K |
See docs/performance.md for detailed analysis.
- HDL: Verilog (Verilator-compatible subset)
- Simulator: Verilator 5.046 (compiled C++ simulation)
- Testbench: C++ (Verilator native API)
- Training: Python 3 + PyTorch
- Build: Make
- Platform: macOS (Apple Silicon M-series)
- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (Google TPU v1)
- Chen et al., "Eyeriss: An Energy-Efficient Reconfigurable Accelerator" (Row-stationary dataflow)
- Project F, "Fixed-Point Numbers in Verilog" (Q-format implementation)