SIMDMatrixAlgorithm — Assembly-Level Matrix Multiplication Benchmark

This project implements high-performance single-precision matrix multiplication in NASM using SIMD instructions (xmm and ymm registers).
It is designed for benchmarking and understanding instruction-level parallelism, latency hiding, and floating-point throughput on modern x86-64 CPUs (tested on AMD Ryzen 5 5600X, Zen 3).

How It Works

Matrix Generation
matrix_generator.c allocates and fills two square matrices (M1, M2), exporting their base addresses to NASM.
Main ASM Routine
SIMDMProduct.asm loads these addresses, loops through rows and columns, and calls a specialized SIMD routine:
- dot_product_xmm → SSE 128-bit version
- dot_product_ymm → AVX 256-bit version
Timing
The assembly code uses RDTSC (Read Time Stamp Counter) with proper serialization via CPUID to measure computation time precisely. Only the matrix multiplication region (no I/O, no allocation) is timed.
Result Display
The total time (in seconds) is stored in xmm0 and printed after computation.

Switching Between XMM and YMM Builds

Choose the SIMD width

Edit this line in your main NASM file (SIMDMProduct.asm):

extern dot_product_ymm
; or
; extern dot_product_xmm
call dot_product_ymm
; or
; call dot_product_xmm

Performance Table

Version	Language	Register Width	Unrolling	Approx. Runtime (512×512)	GFLOP/s	CPI	Notes
Scalar C	C	None (1 float)	None	~0.6 s	< 1	>1.0	Non-vectorized baseline (-O3).
Scalar C++	C++	None (1 float)	None	~1.3 s	< 1	>1.0	Non-vectorized baseline (-O3).
XMM (SSE)	NASM	128-bit (4 floats)	2×	~0.29 s	~10	0.45	Baseline SIMD throughput.
YMM (AVX)	NASM	256-bit (8 floats)	4×	~0.0098 s	~28	0.27	Fully pipelined, high ILP.
Hybrid C + ASM (AVX)	C + ASM	256-bit (8 floats)	4×	~0.010–0.020 s	~25–28	0.27	C driver, ASM compute kernel.
Hybrid C++ + ASM (AVX)	C++ + ASM	256-bit (8 floats)	4×	~0.010–0.020 s	~25–28	0.27	C++ driver, ASM compute kernel.

Benchmark Environment

CPU: AMD Ryzen 5 5600X (Zen 3, 6 cores / 12 threads)
Clock: 3.69 GHz
Compiler: gcc -O3 -mavx -msse -mfma
Memory: 32 GB DDR4 3200 MHz
OS: Ubuntu 22.04.5 LTS x86_64

Observations

The SSE (xmm) version provides a 5–9× speedup vs pure C for moderate matrices.
AVX (ymm) doubles vector width, halving runtime for large matrices (512×512 and beyond).
AVX-512 (zmm) gains further ~30× speedup but requires perfect memory alignment.
For matrices beyond 1024×1024, performance becomes memory-bound rather than compute-bound.
Aligned allocations and blocked computation (tiling) maximize cache reuse.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
HybridC++		HybridC++
HybridC		HybridC
PureC++		PureC++
PureC		PureC
inc		inc
LICENSE		LICENSE
README.md		README.md
SIMDMProduct		SIMDMProduct
SIMDMProduct.asm		SIMDMProduct.asm
SIMDMProduct.o		SIMDMProduct.o
simdcompilexmm.sh		simdcompilexmm.sh
simdcompileymm.sh		simdcompileymm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

SIMDMatrixAlgorithm — Assembly-Level Matrix Multiplication Benchmark

How It Works

Switching Between XMM and YMM Builds

Choose the SIMD width

Performance Table

Benchmark Environment

Observations

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

tslime/SIMDMatrixAlgorithm

Folders and files

Latest commit

History

Repository files navigation

SIMDMatrixAlgorithm — Assembly-Level Matrix Multiplication Benchmark

How It Works

Switching Between XMM and YMM Builds

Choose the SIMD width

Performance Table

Benchmark Environment

Observations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages