Skip to content

tacari/nn-accelerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Network Accelerator — Systolic Array in Verilog

A custom neural network accelerator built from scratch in Verilog, featuring a weight-stationary systolic array that performs matrix multiplication for neural network inference. The design runs MNIST handwritten digit classification entirely in RTL simulation, achieving 97.5% accuracy with Q8.8 fixed-point arithmetic.

Key Results

Metric Value
Float baseline accuracy 97.6%
Q8.8 quantized accuracy 97.5% (0.1% drop)
RTL simulation accuracy 100% (100/100 test images)
Cycles per inference (4×4) ~273K
Total parameters 109,184
Array architecture 4×4 weight-stationary systolic
Arithmetic Q8.8 signed fixed-point (16-bit)

Architecture

                    Weight Data (top)
                    ┌───┬───┬───┬───┐
                    │   │   │   │   │
                    ▼   ▼   ▼   ▼   ▼
              ┌─────┬───┬───┬───┬───┐
Activation ──►│     │PE │PE │PE │PE │──► (unused)
  Data     ──►│     │00 │01 │02 │03 │
  (left)   ──►│     ├───┼───┼───┼───┤
           ──►│     │PE │PE │PE │PE │──►
              │Array│10 │11 │12 │13 │
              │     ├───┼───┼───┼───┤
              │     │PE │PE │PE │PE │──►
              │     │20 │21 │22 │23 │
              │     ├───┼───┼───┼───┤
              │     │PE │PE │PE │PE │──►
              │     │30 │31 │32 │33 │
              └─────┴───┴───┴───┴───┘
                    │   │   │   │
                    ▼   ▼   ▼   ▼
                 Accumulator Outputs (bottom)
                        │
                        ▼
                    ┌────────┐
                    │  ReLU  │
                    └────┬───┘
                         │
                         ▼
                  ┌──────────────┐
                  │Output Buffer │
                  └──────────────┘

Dataflow: Weight-Stationary

In weight-stationary mode, weights are pre-loaded into PEs and remain fixed while activations stream through. This is the same dataflow used in the Google TPU v1.

Why weight-stationary?

  • Higher PE utilization than output-stationary (~100% vs ~50%)
  • Simpler control logic than row-stationary (Eyeriss)
  • Natural fit for fully-connected layers where weights are reused across batches
  • Proven at scale by Google's 256×256 TPU

Fixed-Point Arithmetic: Q8.8

  • Format: 1 sign bit + 7 integer bits + 8 fractional bits = 16 bits total
  • Range: [-128.0, +127.996]
  • Resolution: 1/256 ≈ 0.0039
  • Multiplication: 16×16 → 32-bit product, arithmetic right shift by 8
  • Accumulator: 32-bit to prevent overflow during dot products

Tiling for Large Matrices

Matrices larger than the 4×4 array are decomposed into tiles:

Matrix C[M×N] = A[M×K] × B[K×N]

For each output tile (mt, nt):
  Clear accumulators
  For each K tile (kt):
    Feed A_tile and B_tile through array with skewing
    Accumulate partial results
  Read and store results

Data skewing ensures correct timing: row i is delayed by i cycles, column j is delayed by j cycles. Total compute cycles per tile: 3N-2.

MNIST Network Architecture

Input (28×28 = 784 pixels)
        │
        ▼
┌───────────────┐
│  FC Layer 1   │  784 → 128 (100,352 weights)
│    + ReLU     │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  FC Layer 2   │  128 → 64 (8,192 weights)
│    + ReLU     │
└───────┬───────┘
        │
        ▼
┌───────────────┐
│  FC Layer 3   │  64 → 10 (640 weights)
│   (logits)    │
└───────┬───────┘
        │
        ▼
    argmax → digit (0-9)

Trained in PyTorch for 15 epochs with Adam optimizer. No bias terms — weights only.

Project Structure

nn-accelerator/
├── rtl/                          # Verilog RTL design
│   ├── mac_pe.v                  # MAC processing element (16-bit Q8.8)
│   ├── systolic_array.v          # 4×4 PE grid with generate blocks
│   ├── relu.v                    # Combinational ReLU activation
│   ├── weight_buffer.v           # Weight SRAM model
│   ├── input_buffer.v            # Input activation SRAM model
│   ├── output_buffer.v           # Result SRAM model
│   ├── control_unit.v            # FSM controller with tiling logic
│   └── accelerator_top.v         # Top-level module integration
├── tb/                           # Verilator C++ testbenches
│   ├── tb_mac_pe.cpp             # PE unit tests (12 tests)
│   ├── tb_relu.cpp               # ReLU unit tests (13 tests)
│   ├── tb_systolic.cpp           # Array matrix multiply tests (5 tests)
│   ├── tb_accelerator.cpp        # Full system tests with tiling (7 tests)
│   └── tb_mnist.cpp              # MNIST inference (100 images)
├── scripts/                      # Python utilities
│   ├── train_mnist.py            # PyTorch MNIST training
│   ├── quantize_weights.py       # Q8.8 quantization + accuracy validation
│   ├── gen_test_vectors.py       # Test matrix generation
│   └── validate_results.py       # Result validation
├── sim/                          # Simulation build
│   └── Makefile                  # Verilator build targets
├── data/                         # Generated data (not in git)
│   ├── mnist_weights/            # Quantized weight hex files
│   └── mnist_test/               # Test image hex files
├── docs/                         # Documentation
│   ├── architecture.md           # Detailed module specifications
│   └── performance.md            # Benchmarks and analysis
└── README.md

Building and Running

Prerequisites

brew install verilator           # Verilator 5.x
python3 -m venv .venv            # Python virtual environment
source .venv/bin/activate
pip install torch torchvision numpy

Run All Tests

cd sim

# Unit tests (fast)
make test_mac_pe        # MAC PE: 12 tests
make test_relu          # ReLU: 13 tests
make test_systolic      # Systolic array: 5 tests
make test_accelerator   # Full system: 7 tests

# MNIST inference (requires trained model)
cd .. && source .venv/bin/activate
python3 scripts/train_mnist.py       # Train model (~97% accuracy)
python3 scripts/quantize_weights.py  # Quantize to Q8.8, export hex
cd sim && make test_mnist            # Run inference on RTL

View Waveforms

make test_mac_pe         # Generates mac_pe.vcd
make waves_mac_pe        # Opens in GTKWave

Design Decisions

Why Weight-Stationary over Row-Stationary?

Row-stationary (Eyeriss) optimizes data reuse across weights, activations, and partial sums, achieving better energy efficiency for convolutions. However, it requires significantly more complex control logic (per-PE control, multicast networks, global buffer management). For our fully-connected MNIST workload, weight-stationary provides equivalent computational efficiency with much simpler hardware — a single broadcast of activations across rows, and sequential weight loading through columns.

Why Q8.8 over Q4.12?

Q4.12 offers 4× more fractional precision but limits the integer range to [-8, +8). Neural network activations after ReLU can exceed this range, especially in deeper layers. Q8.8 provides [-128, +128) integer range with 1/256 fractional resolution, which is more than sufficient for MNIST — as demonstrated by only 0.1% accuracy drop from float32.

Why No Bias Terms?

Omitting bias simplifies the hardware (no additional adder stage) while having minimal impact on accuracy for MNIST. The model learns compensating weight distributions during training. This is a common approach in hardware-oriented neural network design.

Tiling Implementation

When matrix dimensions exceed the array size, we tile the computation:

  • K-dimension tiles accumulate partial sums (accumulator NOT cleared between tiles)
  • M/N-dimension tiles are independent output blocks (accumulator cleared)
  • Zero-padding for dimensions not divisible by ARRAY_SIZE

Performance

Layer Parameters K Tiles Cycles (4×4) Cycles (8×8 est.)
FC1 100,352 196 ~251K ~63K
FC2 8,192 32 ~20K ~5K
FC3 640 16 ~1.6K ~0.4K
Total 109,184 ~273K ~68K

See docs/performance.md for detailed analysis.

Tech Stack

  • HDL: Verilog (Verilator-compatible subset)
  • Simulator: Verilator 5.046 (compiled C++ simulation)
  • Testbench: C++ (Verilator native API)
  • Training: Python 3 + PyTorch
  • Build: Make
  • Platform: macOS (Apple Silicon M-series)

References

  • Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (Google TPU v1)
  • Chen et al., "Eyeriss: An Energy-Efficient Reconfigurable Accelerator" (Row-stationary dataflow)
  • Project F, "Fixed-Point Numbers in Verilog" (Q-format implementation)

About

Neural network accelerator with weight-stationary systolic array in Verilog. Runs MNIST inference in RTL simulation with 97.5% accuracy using Q8.8 fixed-point arithmetic.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors