Skip to content

High-performance single-precision matrix multiplication benchmark using NASM with SIMD (SSE/AVX) instructions. Designed for instruction-level parallelism, latency hiding, and throughput evaluation on modern x86-64 CPUs.

License

tslime/SIMDMatrixAlgorithm

Repository files navigation

SIMDMatrixAlgorithm — Assembly-Level Matrix Multiplication Benchmark

This project implements high-performance single-precision matrix multiplication in NASM using SIMD instructions (xmm and ymm registers).
It is designed for benchmarking and understanding instruction-level parallelism, latency hiding, and floating-point throughput on modern x86-64 CPUs (tested on AMD Ryzen 5 5600X, Zen 3).

How It Works

  1. Matrix Generation
    matrix_generator.c allocates and fills two square matrices (M1, M2), exporting their base addresses to NASM.

  2. Main ASM Routine
    SIMDMProduct.asm loads these addresses, loops through rows and columns, and calls a specialized SIMD routine:

    • dot_product_xmm → SSE 128-bit version
    • dot_product_ymm → AVX 256-bit version
  3. Timing
    The assembly code uses RDTSC (Read Time Stamp Counter) with proper serialization via CPUID to measure computation time precisely. Only the matrix multiplication region (no I/O, no allocation) is timed.

  4. Result Display
    The total time (in seconds) is stored in xmm0 and printed after computation.

Switching Between XMM and YMM Builds

Choose the SIMD width

Edit this line in your main NASM file (SIMDMProduct.asm):

extern dot_product_ymm
; or
; extern dot_product_xmm
call dot_product_ymm
; or
; call dot_product_xmm

Performance Table

Version Language Register Width Unrolling Approx. Runtime (512×512) GFLOP/s CPI Notes
Scalar C C None (1 float) None ~0.6 s < 1 >1.0 Non-vectorized baseline (-O3).
Scalar C++ C++ None (1 float) None ~1.3 s < 1 >1.0 Non-vectorized baseline (-O3).
XMM (SSE) NASM 128-bit (4 floats) ~0.29 s ~10 0.45 Baseline SIMD throughput.
YMM (AVX) NASM 256-bit (8 floats) ~0.0098 s ~28 0.27 Fully pipelined, high ILP.
Hybrid C + ASM (AVX) C + ASM 256-bit (8 floats) ~0.010–0.020 s ~25–28 0.27 C driver, ASM compute kernel.
Hybrid C++ + ASM (AVX) C++ + ASM 256-bit (8 floats) ~0.010–0.020 s ~25–28 0.27 C++ driver, ASM compute kernel.

Benchmark Environment

  • CPU: AMD Ryzen 5 5600X (Zen 3, 6 cores / 12 threads)
  • Clock: 3.69 GHz
  • Compiler: gcc -O3 -mavx -msse -mfma
  • Memory: 32 GB DDR4 3200 MHz
  • OS: Ubuntu 22.04.5 LTS x86_64

Observations

  • The SSE (xmm) version provides a 5–9× speedup vs pure C for moderate matrices.
  • AVX (ymm) doubles vector width, halving runtime for large matrices (512×512 and beyond).
  • AVX-512 (zmm) gains further ~30× speedup but requires perfect memory alignment.
  • For matrices beyond 1024×1024, performance becomes memory-bound rather than compute-bound.
  • Aligned allocations and blocked computation (tiling) maximize cache reuse.

About

High-performance single-precision matrix multiplication benchmark using NASM with SIMD (SSE/AVX) instructions. Designed for instruction-level parallelism, latency hiding, and throughput evaluation on modern x86-64 CPUs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published