A high-performance SIMD (Single Instruction, Multiple Data) library for Go providing vectorized operations on float64, float32, and complex128 slices.
- Pure Go assembly - Native Go assembler, simple cross-compilation
- Runtime CPU detection - Automatically selects optimal implementation (AVX-512, AVX+FMA, SSE2, NEON, or pure Go)
- Zero allocations - All operations work on pre-allocated slices
- 46 operations - Arithmetic, reduction, statistical, vector, signal processing, activation functions, and complex number operations
- Multi-architecture - AMD64 (AVX-512/AVX+FMA/SSE2) and ARM64 (NEON) with pure Go fallback
- Thread-safe - All functions are safe for concurrent use
go get github.com/tphakala/simdRequires Go 1.25+
package main
import (
"fmt"
"github.com/tphakala/simd/cpu"
"github.com/tphakala/simd/f64"
)
func main() {
fmt.Println("SIMD:", cpu.Info())
// Vector operations
a := []float64{1, 2, 3, 4, 5, 6, 7, 8}
b := []float64{8, 7, 6, 5, 4, 3, 2, 1}
// Dot product
dot := f64.DotProduct(a, b)
fmt.Println("Dot product:", dot) // 120
// Element-wise operations
dst := make([]float64, len(a))
f64.Add(dst, a, b)
fmt.Println("Sum:", dst) // [9, 9, 9, 9, 9, 9, 9, 9]
// Statistical operations
mean := f64.Mean(a)
stddev := f64.StdDev(a)
fmt.Printf("Mean: %.2f, StdDev: %.2f\n", mean, stddev)
// Vector operations
f64.Normalize(dst, a)
fmt.Println("Normalized:", dst)
// Distance calculation
dist := f64.EuclideanDistance(a, b)
fmt.Println("Distance:", dist)
}import "github.com/tphakala/simd/cpu"
fmt.Println(cpu.Info()) // "AMD64 AVX-512", "AMD64 AVX+FMA", "AMD64 SSE2", or "ARM64 NEON"
fmt.Println(cpu.HasAVX()) // true/false
fmt.Println(cpu.HasAVX512()) // true/false
fmt.Println(cpu.HasNEON()) // true/false| Category | Function | Description | SIMD Width |
|---|---|---|---|
| Arithmetic | Add(dst, a, b) |
Element-wise addition | 8x (AVX-512) / 4x (AVX) / 2x (NEON) |
Sub(dst, a, b) |
Element-wise subtraction | 8x / 4x / 2x | |
Mul(dst, a, b) |
Element-wise multiplication | 8x / 4x / 2x | |
Div(dst, a, b) |
Element-wise division | 8x / 4x / 2x | |
Scale(dst, a, s) |
Multiply by scalar | 8x / 4x / 2x | |
AddScalar(dst, a, s) |
Add scalar | 8x / 4x / 2x | |
FMA(dst, a, b, c) |
Fused multiply-add: a*b+c | 8x / 4x / 2x | |
AddScaled(dst, alpha, s) |
dst += alpha*s (axpy) | 8x / 4x / 2x | |
| Unary | Abs(dst, a) |
Absolute value | 8x / 4x / 2x |
Neg(dst, a) |
Negation | 8x / 4x / 2x | |
Sqrt(dst, a) |
Square root | 8x / 4x / 2x | |
Reciprocal(dst, a) |
Reciprocal (1/x) | 8x / 4x / 2x | |
| Reduction | DotProduct(a, b) |
Dot product | 8x / 4x / 2x |
Sum(a) |
Sum of elements | 8x / 4x / 2x | |
Min(a) |
Minimum value | 8x / 4x / 2x | |
Max(a) |
Maximum value | 8x / 4x / 2x | |
MinIdx(a) |
Index of minimum value | Pure Go | |
MaxIdx(a) |
Index of maximum value | Pure Go | |
| Statistical | Mean(a) |
Arithmetic mean | 8x / 4x / 2x |
Variance(a) |
Population variance | 8x / 4x / 2x | |
StdDev(a) |
Standard deviation | 8x / 4x / 2x | |
| Vector | EuclideanDistance(a, b) |
L2 distance | 8x / 4x / 2x |
Normalize(dst, a) |
Unit vector normalization | 8x / 4x / 2x | |
CumulativeSum(dst, a) |
Running sum | Sequential | |
| Range | Clamp(dst, a, min, max) |
Clamp to range | 8x / 4x / 2x |
| Activation | Sigmoid(dst, src) |
Sigmoid: 1/(1+e^-x) | 4x (AVX) / 2x (NEON) |
ReLU(dst, src) |
Rectified Linear Unit | 8x / 4x / 2x | |
Tanh(dst, src) |
Hyperbolic tangent | 8x / 4x / 2x | |
Exp(dst, src) |
Exponential e^x | Pure Go | |
ClampScale(dst, src, min, max, s) |
Fused clamp and scale | 8x / 4x / 2x | |
| Batch | DotProductBatch(r, rows, v) |
Multiple dot products | 8x / 4x / 2x |
| Signal | ConvolveValid(dst, sig, k) |
FIR filter / convolution | 8x / 4x / 2x |
ConvolveValidMulti(dsts, sig, ks) |
Multi-kernel convolution | 8x / 4x / 2x | |
AccumulateAdd(dst, src, off) |
Overlap-add: dst[off:] += src | 8x / 4x / 2x | |
| Audio | Interleave2(dst, a, b) |
Pack stereo: [L,R,L,R,...] | 4x / 2x |
Deinterleave2(a, b, src) |
Unpack stereo to channels | 4x / 2x | |
CubicInterpDot(hist,a,b,c,d,x) |
Fused cubic interp dot product | 4x / 2x | |
Int32ToFloat32Scale(dst,src,s) |
PCM int32 to normalized float | 8x / 4x |
Same API as f64 but for float32 with wider SIMD:
| Architecture | SIMD Width |
|---|---|
| AMD64 (AVX-512) | 16x float32 |
| AMD64 (AVX+FMA) | 8x float32 |
| AMD64 (SSE2) | 4x float32 |
| ARM64 (NEON) | 4x float32 |
SIMD-accelerated complex number operations for FFT-based signal processing:
| Category | Function | Description | SIMD Width |
|---|---|---|---|
| Arithmetic | Mul(dst, a, b) |
Complex multiplication | 4x (AVX-512) / 2x (AVX) |
MulConj(dst, a, b) |
Multiply by conjugate: a × conj(b) | 4x / 2x | |
Scale(dst, a, s) |
Scale by complex scalar | 4x / 2x | |
Add(dst, a, b) |
Complex addition | 4x / 2x | |
Sub(dst, a, b) |
Complex subtraction | 4x / 2x | |
| Unary | Abs(dst, a) |
Complex magnitude |a + bi| | 4x (AVX-512) / 2x (AVX) |
AbsSq(dst, a) |
Magnitude squared |a + bi|² | 4x / 2x | |
Conj(dst, a) |
Complex conjugate: a - bi | 4x / 2x |
These operations are designed for FFT-based signal processing pipelines:
import "github.com/tphakala/simd/c128"
// Frequency-domain multiplication (FFT convolution)
signalFFT := make([]complex128, n)
kernelFFT := make([]complex128, n)
result := make([]complex128, n)
magnitude := make([]float64, n)
// Frequency-domain filtering
c128.Mul(result, signalFFT, kernelFFT) // Complex multiply
c128.MulConj(result, signalFFT, kernelFFT) // Cross-correlation
// Spectrogram and magnitude analysis
c128.Abs(magnitude, signalFFT) // Extract magnitude for displayUse Cases:
- Abs/AbsSq: Spectrograms, power spectral density, frequency analysis
- Conj: Cross-correlation, frequency-domain filtering
- Mul/MulConj: FFT-based convolution, filtering, correlation
Benchmark (1024 elements, Intel i7-1260P AVX+FMA):
| Operation | SIMD | Pure Go | Speedup |
|---|---|---|---|
| Mul | 341 ns | 757 ns | 2.2x |
| MulConj | 340 ns | 749 ns | 2.2x |
| Scale | 253 ns | 551 ns | 2.2x |
| Add | 86 ns | 189 ns | 2.2x |
| Abs | 1326 ns | 2260 ns | 1.7x |
| AbsSq | 367 ns | 504 ns | 1.37x |
| Conj | 304 ns | 474 ns | 1.56x |
| Category | Operation | SIMD (ns) | Go (ns) | Speedup |
|---|---|---|---|---|
| Arithmetic | Add | 84 | 446 | 5.3x |
| Sub | 84 | 335 | 4.0x | |
| Mul | 86 | 436 | 5.1x | |
| Div | 441 | 941 | 2.1x | |
| Scale | 68 | 272 | 4.0x | |
| AddScalar | 68 | 286 | 4.2x | |
| FMA | 110 | 557 | 5.0x | |
| Unary | Abs | 66 | 365 | 5.6x |
| Neg | 66 | 306 | 4.6x | |
| Sqrt | 658 | 1323 | 2.0x | |
| Reciprocal | 447 | 920 | 2.1x | |
| Reduction | DotProduct | 162 | 859 | 5.3x |
| Sum | 82 | 184 | 2.3x | |
| Min | 157 | 340 | 2.2x | |
| Max | 154 | 352 | 2.3x | |
| Statistical | Mean | 82 | 184 | 2.3x |
| Variance* | 820 | 902 | 1.1x | |
| StdDev* | 825 | 905 | 1.1x | |
| Vector | EuclideanDistance | 216 | 1071 | 5.0x |
| Normalize | 220 | 1080 | 4.9x | |
| CumulativeSum | 428 | 425 | 1.0x | |
| Range | Clamp | 81 | 640 | 7.9x |
*Variance/StdDev benchmarked at 4096 elements (SIMD benefits at larger sizes)
| Category | Operation | SIMD (ns) | Go (ns) | Speedup |
|---|---|---|---|---|
| Arithmetic | Add | 47 | 441 | 9.4x |
| Sub | 49 | 339 | 6.9x | |
| Mul | 49 | 436 | 8.9x | |
| Div | 138 | 655 | 4.8x | |
| Scale | 40 | 299 | 7.4x | |
| AddScalar | 39 | 272 | 7.0x | |
| FMA | 64 | 444 | 6.9x | |
| Unary | Abs | 37 | 656 | 17.6x |
| Neg | 40 | 273 | 6.9x | |
| Reduction | DotProduct | 71 | 424 | 5.9x |
| Sum | 41 | 123 | 3.0x | |
| Min | 65 | 340 | 5.2x | |
| Max | 66 | 352 | 5.3x | |
| Range | Clamp | 47 | 701 | 14.8x |
float32 (1024 elements):
| Function | SIMD (ns) | Go (ns) | Speedup | SIMD Throughput |
|---|---|---|---|---|
| Sigmoid | 138 | 5906 | 43x | 59.3 GB/s |
| ReLU | 39 | 662 | 17x | 211 GB/s |
| Tanh | 138 | 28116 | 204x | 59.5 GB/s |
float64 (1024 elements):
| Function | SIMD (ns) | Go (ns) | Speedup | SIMD Throughput |
|---|---|---|---|---|
| ReLU | 68 | 646 | 9.5x | 240 GB/s |
| Tanh | 445 | 6230 | 14x | 36.8 GB/s |
Key Characteristics:
- Tanh: 200x+ speedup for f32 - fast approximation with saturation vs math.Tanh
- ReLU: Highest throughput (211-240 GB/s) - simple max(0, x) operation
- Sigmoid: 43x speedup for f32 - fast approximation with exponential
| Operation | Config | SIMD | Go | Speedup |
|---|---|---|---|---|
| DotProductBatch (f64) | 256 vec × 100 rows | 3.2 µs | 20.5 µs | 6.4x |
| DotProductBatch (f32) | 256 vec × 100 rows | 1.5 µs | 9.8 µs | 6.7x |
| ConvolveValid (f64) | 4096 sig × 64 ker | 26.6 µs | 169 µs | 6.3x |
| ConvolveValid (f32) | 4096 sig × 64 ker | 17.9 µs | 80 µs | 4.5x |
| ConvolveValidMulti (f64) | 1000 sig × 64 ker × 2 | 13.4 µs | - | - |
| CubicInterpDot (f64) | 241 taps | 47 ns | 88 ns | 1.9x |
| CubicInterpDot (f32) | 241 taps | 21 ns | 66 ns | 3.1x |
| Int32ToFloat32Scale | 1024 elements | 40 ns | 364 ns | 9.0x |
| Int32ToFloat32Scale | 4096 elements | 153 ns | 1439 ns | 9.4x |
| Interleave2 (f64) | 1000 pairs | 216 ns | - | - |
| Deinterleave2 (f64) | 1000 pairs | 216 ns | - | - |
| Interleave2 (f32) | 1000 pairs | 109 ns | - | - |
| Deinterleave2 (f32) | 1000 pairs | 216 ns | - | - |
| Package | Average Speedup | Best | Operations |
|---|---|---|---|
| f32 | 6.5x | 21.8x (Abs) | 32 functions |
| f64 | 3.2x | 7.9x (Clamp) | 32 functions |
| c128 | 1.77x | 2.2x (Mul) | 8 functions |
| Operation | Size | Time | Throughput |
|---|---|---|---|
| DotProduct | 277 | 151 ns | 29 GB/s |
| DotProduct | 1000 | 513 ns | 31 GB/s |
| Add | 1000 | 775 ns | 31 GB/s |
| Mul | 1000 | 727 ns | 33 GB/s |
| FMA | 1000 | 890 ns | 36 GB/s |
| Sum | 1000 | 635 ns | 13 GB/s |
| Mean | 1000 | 677 ns | 12 GB/s |
| Operation | Size | Time | Throughput |
|---|---|---|---|
| DotProduct | 100 | 37 ns | 21 GB/s |
| DotProduct | 1000 | 263 ns | 30 GB/s |
| DotProduct | 10000 | 2.78 µs | 29 GB/s |
| Add | 1000 | 389 ns | 31 GB/s |
| Mul | 1000 | 390 ns | 31 GB/s |
| FMA | 1000 | 479 ns | 33 GB/s |
| Operation | Size | SIMD | Pure Go | Speedup |
|---|---|---|---|---|
| DotProduct (f32) | 100 | 37 ns | 137 ns | 3.7x |
| DotProduct (f32) | 1000 | 262 ns | 1350 ns | 5.2x |
| DotProduct (f64) | 100 | 62 ns | 138 ns | 2.2x |
| DotProduct (f64) | 1000 | 513 ns | 1353 ns | 2.6x |
| Add (f32) | 1000 | 389 ns | 2015 ns | 5.2x |
| Sum (f32) | 1000 | 343 ns | 1327 ns | 3.9x |
-
AMD64: Explicit SIMD provides 5x speedups for most operations compared to pure Go, with consistent high throughput across all vector sizes.
-
ARM64: NEON SIMD provides substantial speedups over pure Go across all operations:
- float32: 3.7x - 5.2x faster (4 elements per 128-bit vector)
- float64: 2.2x - 2.6x faster (2 elements per 128-bit vector)
-
CumulativeSum is inherently sequential (each element depends on the previous) and uses pure Go on all platforms.
On AMD64, the Min and Max functions fall back to pure Go for small slices:
- float64: slices with fewer than 4 elements
- float32: slices with fewer than 8 elements
This is because AVX assembly loads multiple elements at once (4 float64s or 8 float32s), which would cause out-of-bounds memory access on smaller slices.
The Go fallback for small slices is intentional and likely optimal - SIMD setup overhead (register loading, masking, horizontal reduction) would exceed the cost of a simple 2-3 element comparison loop.
| Architecture | Instruction Set | Status |
|---|---|---|
| AMD64 | AVX-512 | Full SIMD support |
| AMD64 | AVX + FMA | Full SIMD support |
| AMD64 | SSE2 | Full SIMD support |
| ARM64 | NEON/ASIMD | Full SIMD support |
| Other | - | Pure Go fallback |
- Pure Go assembly - Native Go assembler for maximum portability and easy cross-compilation
- Runtime dispatch - CPU features detected once at init time, zero runtime overhead
- Zero allocations - No heap allocations in hot paths
- Safe defaults - Gracefully falls back to pure Go on unsupported CPUs
- Boundary safe - Handles any slice length, not just SIMD-aligned sizes
The library includes comprehensive tests with pure Go reference implementations for validation:
# Run all tests
go test ./...
# Run tests with verbose output
task test
# Run benchmarks
task bench
# Compare SIMD vs pure Go performance
task bench:compare
# Show CPU SIMD capabilities
task cpuSee Taskfile.yml for all available tasks.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.