This project implements high-performance single-precision matrix multiplication in NASM using SIMD instructions (xmm and ymm registers).
It is designed for benchmarking and understanding instruction-level parallelism, latency hiding, and floating-point throughput on modern x86-64 CPUs (tested on AMD Ryzen 5 5600X, Zen 3).
-
Matrix Generation
matrix_generator.callocates and fills two square matrices (M1,M2), exporting their base addresses to NASM. -
Main ASM Routine
SIMDMProduct.asmloads these addresses, loops through rows and columns, and calls a specialized SIMD routine:dot_product_xmm→ SSE 128-bit versiondot_product_ymm→ AVX 256-bit version
-
Timing
The assembly code usesRDTSC(Read Time Stamp Counter) with proper serialization viaCPUIDto measure computation time precisely. Only the matrix multiplication region (no I/O, no allocation) is timed. -
Result Display
The total time (in seconds) is stored inxmm0and printed after computation.
Edit this line in your main NASM file (SIMDMProduct.asm):
extern dot_product_ymm
; or
; extern dot_product_xmm
call dot_product_ymm
; or
; call dot_product_xmm| Version | Language | Register Width | Unrolling | Approx. Runtime (512×512) | GFLOP/s | CPI | Notes |
|---|---|---|---|---|---|---|---|
| Scalar C | C | None (1 float) | None | ~0.6 s | < 1 | >1.0 | Non-vectorized baseline (-O3). |
| Scalar C++ | C++ | None (1 float) | None | ~1.3 s | < 1 | >1.0 | Non-vectorized baseline (-O3). |
| XMM (SSE) | NASM | 128-bit (4 floats) | 2× | ~0.29 s | ~10 | 0.45 | Baseline SIMD throughput. |
| YMM (AVX) | NASM | 256-bit (8 floats) | 4× | ~0.0098 s | ~28 | 0.27 | Fully pipelined, high ILP. |
| Hybrid C + ASM (AVX) | C + ASM | 256-bit (8 floats) | 4× | ~0.010–0.020 s | ~25–28 | 0.27 | C driver, ASM compute kernel. |
| Hybrid C++ + ASM (AVX) | C++ + ASM | 256-bit (8 floats) | 4× | ~0.010–0.020 s | ~25–28 | 0.27 | C++ driver, ASM compute kernel. |
- CPU: AMD Ryzen 5 5600X (Zen 3, 6 cores / 12 threads)
- Clock: 3.69 GHz
- Compiler:
gcc -O3 -mavx -msse -mfma - Memory: 32 GB DDR4 3200 MHz
- OS: Ubuntu 22.04.5 LTS x86_64
- The SSE (xmm) version provides a 5–9× speedup vs pure C for moderate matrices.
- AVX (ymm) doubles vector width, halving runtime for large matrices (512×512 and beyond).
- AVX-512 (zmm) gains further ~30× speedup but requires perfect memory alignment.
- For matrices beyond 1024×1024, performance becomes memory-bound rather than compute-bound.
- Aligned allocations and blocked computation (tiling) maximize cache reuse.