A high-performance hardware accelerator designed for image convolution operations in Convolutional Neural Networks (CNNs). This implementation provides an efficient FPGA-based solution for accelerating the computationally intensive convolution operations commonly found in deep learning inference.
This accelerator is designed to perform 2D convolution operations on image data using dedicated hardware resources including BRAM buffers, matrix multiplication units, and optimized line buffer architectures. The design focuses on maximizing throughput while minimizing resource utilization for deployment on FPGA platforms.
- High-throughput convolution processing - Optimized for real-time image processing
- BRAM-based buffering - Efficient memory management using block RAM resources
- Pipelined architecture - Maximizes clock frequency and data throughput
- Configurable parameters - Supports various kernel sizes and input dimensions
- Line buffer optimization - Minimizes memory access patterns for improved performance
The accelerator consists of several key components:
- ImageConv: Top-level convolution engine that orchestrates the entire operation
- bramBuffer: BRAM-based buffer management for input/output data storage
- MatrixMult: Optimized matrix multiplication unit for convolution computation
- linebufferBRAM: Line buffer implementation using BRAM for sliding window operations
├── src/
│ ├── ImageConv/ # Top-level convolution accelerator module
│ ├── bramBuffer/ # BRAM buffer management components
│ ├── MatrixMult/ # Matrix multiplication engine
│ ├── linebufferBRAM/ # Line buffer implementation
│ └── ImageConvTB/ # Testbench for the accelerator
├── tests/ # Test vectors and validation scripts
└── README.md # This file
- FPGA development tools (Vivado, Quartus, or similar)
- HDL simulator (ModelSim, Vivado Simulator, etc.)
- Clone this repository:
git clone <repository-url>
cd image-convolution-accelerator
-
Open your FPGA development environment and add all source files from the
src/
directory. -
Set
ImageConv
as the top-level module for synthesis. -
Configure synthesis and implementation settings based on your target FPGA device.
- Set ImageConvTB as top module
- Uncomment desired test from lines 60-130
- Verify that all tests pass and check the generated waveforms for correctness.
The accelerator supports several configurable parameters:
Parameter | Description | Default Value |
---|---|---|
PIXEL_WIDTH |
Bits required to represent pixel | 8 |
ROW_LENGTH |
Input image width | 512 |
Modify these parameters in the top-level module to match your specific application requirements.
The accelerator has been optimized for the following performance characteristics:
- Throughput: Can achieve 5.74 GOPs (device dependent)
- Latency: 2583 clock cycles for first output (pipeline latency)
- Resource Usage: Optimized for minimal LUT and BRAM utilization
- Clock Frequency: Achieves at least 274 MHz on Arria 10 GX FPGAs
// Instantiate the Image Convolution Accelerator
ImageConv #(
.PIXEL_WIDTH(8),
.ROW_LENGTH(512)
) conv_accelerator (
.clk(clk),
.reset(rst),
.i_f(filter), //3x3 filter of 8 bit pixels
.i_valid(i_valid),
.i_ready(i_ready),
.i_x(i_x), //8 bit input pixel
.o_valid(o_valid),
.o_ready(o_ready),
.o_y(o_y) //8 bit output pixel
);
The project includes a comprehensive testbench located at src/ImageConvTB.v
. This verifies:
- Functional correctness against software reference models
- Edge case handling (boundary conditions, overflow, etc.)
- Performance benchmarks
- Originally Designed for my Reconfigurable FPGA Architecture class, however since optimizing the design for performance and area, I no longer have access to a Quartus II license to obtain measurements on an Arria 10 GX FPGA
- Based on research in hardware acceleration for deep learning
- Optimized for modern FPGA architectures
- Inspired by state-of-the-art CNN acceleration techniques