simd

A high-performance SIMD (Single Instruction, Multiple Data) library for Go providing vectorized operations on float64, float32, and complex128 slices.

Features

Pure Go assembly - Native Go assembler, simple cross-compilation
Runtime CPU detection - Automatically selects optimal implementation (AVX-512, AVX+FMA, SSE2, NEON, or pure Go)
Zero allocations - All operations work on pre-allocated slices
46 operations - Arithmetic, reduction, statistical, vector, signal processing, activation functions, and complex number operations
Multi-architecture - AMD64 (AVX-512/AVX+FMA/SSE2) and ARM64 (NEON) with pure Go fallback
Thread-safe - All functions are safe for concurrent use

Installation

go get github.com/tphakala/simd

Requires Go 1.25+

Quick Start

package main

import (
    "fmt"
    "github.com/tphakala/simd/cpu"
    "github.com/tphakala/simd/f64"
)

func main() {
    fmt.Println("SIMD:", cpu.Info())

    // Vector operations
    a := []float64{1, 2, 3, 4, 5, 6, 7, 8}
    b := []float64{8, 7, 6, 5, 4, 3, 2, 1}

    // Dot product
    dot := f64.DotProduct(a, b)
    fmt.Println("Dot product:", dot) // 120

    // Element-wise operations
    dst := make([]float64, len(a))
    f64.Add(dst, a, b)
    fmt.Println("Sum:", dst) // [9, 9, 9, 9, 9, 9, 9, 9]

    // Statistical operations
    mean := f64.Mean(a)
    stddev := f64.StdDev(a)
    fmt.Printf("Mean: %.2f, StdDev: %.2f\n", mean, stddev)

    // Vector operations
    f64.Normalize(dst, a)
    fmt.Println("Normalized:", dst)

    // Distance calculation
    dist := f64.EuclideanDistance(a, b)
    fmt.Println("Distance:", dist)
}

Packages

`cpu` - CPU Feature Detection

import "github.com/tphakala/simd/cpu"

fmt.Println(cpu.Info())      // "AMD64 AVX-512", "AMD64 AVX+FMA", "AMD64 SSE2", or "ARM64 NEON"
fmt.Println(cpu.HasAVX())    // true/false
fmt.Println(cpu.HasAVX512()) // true/false
fmt.Println(cpu.HasNEON())   // true/false

`f64` - float64 Operations

Category	Function	Description	SIMD Width
Arithmetic	`Add(dst, a, b)`	Element-wise addition	8x (AVX-512) / 4x (AVX) / 2x (NEON)
	`Sub(dst, a, b)`	Element-wise subtraction	8x / 4x / 2x
	`Mul(dst, a, b)`	Element-wise multiplication	8x / 4x / 2x
	`Div(dst, a, b)`	Element-wise division	8x / 4x / 2x
	`Scale(dst, a, s)`	Multiply by scalar	8x / 4x / 2x
	`AddScalar(dst, a, s)`	Add scalar	8x / 4x / 2x
	`FMA(dst, a, b, c)`	Fused multiply-add: a*b+c	8x / 4x / 2x
	`AddScaled(dst, alpha, s)`	dst += alpha*s (axpy)	8x / 4x / 2x
Unary	`Abs(dst, a)`	Absolute value	8x / 4x / 2x
	`Neg(dst, a)`	Negation	8x / 4x / 2x
	`Sqrt(dst, a)`	Square root	8x / 4x / 2x
	`Reciprocal(dst, a)`	Reciprocal (1/x)	8x / 4x / 2x
Reduction	`DotProduct(a, b)`	Dot product	8x / 4x / 2x
	`Sum(a)`	Sum of elements	8x / 4x / 2x
	`Min(a)`	Minimum value	8x / 4x / 2x
	`Max(a)`	Maximum value	8x / 4x / 2x
	`MinIdx(a)`	Index of minimum value	Pure Go
	`MaxIdx(a)`	Index of maximum value	Pure Go
Statistical	`Mean(a)`	Arithmetic mean	8x / 4x / 2x
	`Variance(a)`	Population variance	8x / 4x / 2x
	`StdDev(a)`	Standard deviation	8x / 4x / 2x
Vector	`EuclideanDistance(a, b)`	L2 distance	8x / 4x / 2x
	`Normalize(dst, a)`	Unit vector normalization	8x / 4x / 2x
	`CumulativeSum(dst, a)`	Running sum	Sequential
Range	`Clamp(dst, a, min, max)`	Clamp to range	8x / 4x / 2x
Activation	`Sigmoid(dst, src)`	Sigmoid: 1/(1+e^-x)	4x (AVX) / 2x (NEON)
	`ReLU(dst, src)`	Rectified Linear Unit	8x / 4x / 2x
	`Tanh(dst, src)`	Hyperbolic tangent	8x / 4x / 2x
	`Exp(dst, src)`	Exponential e^x	Pure Go
	`ClampScale(dst, src, min, max, s)`	Fused clamp and scale	8x / 4x / 2x
Batch	`DotProductBatch(r, rows, v)`	Multiple dot products	8x / 4x / 2x
Signal	`ConvolveValid(dst, sig, k)`	FIR filter / convolution	8x / 4x / 2x
	`ConvolveValidMulti(dsts, sig, ks)`	Multi-kernel convolution	8x / 4x / 2x
	`AccumulateAdd(dst, src, off)`	Overlap-add: dst[off:] += src	8x / 4x / 2x
Audio	`Interleave2(dst, a, b)`	Pack stereo: [L,R,L,R,...]	4x / 2x
	`Deinterleave2(a, b, src)`	Unpack stereo to channels	4x / 2x
	`CubicInterpDot(hist,a,b,c,d,x)`	Fused cubic interp dot product	4x / 2x
	`Int32ToFloat32Scale(dst,src,s)`	PCM int32 to normalized float	8x / 4x

`f32` - float32 Operations

Same API as f64 but for float32 with wider SIMD:

Architecture	SIMD Width
AMD64 (AVX-512)	16x float32
AMD64 (AVX+FMA)	8x float32
AMD64 (SSE2)	4x float32
ARM64 (NEON)	4x float32

`c128` - complex128 Operations

SIMD-accelerated complex number operations for FFT-based signal processing:

Category	Function	Description	SIMD Width
Arithmetic	`Mul(dst, a, b)`	Complex multiplication	4x (AVX-512) / 2x (AVX)
	`MulConj(dst, a, b)`	Multiply by conjugate: a × conj(b)	4x / 2x
	`Scale(dst, a, s)`	Scale by complex scalar	4x / 2x
	`Add(dst, a, b)`	Complex addition	4x / 2x
	`Sub(dst, a, b)`	Complex subtraction	4x / 2x
Unary	`Abs(dst, a)`	Complex magnitude \|a + bi\|	4x (AVX-512) / 2x (AVX)
	`AbsSq(dst, a)`	Magnitude squared \|a + bi\|²	4x / 2x
	`Conj(dst, a)`	Complex conjugate: a - bi	4x / 2x

These operations are designed for FFT-based signal processing pipelines:

import "github.com/tphakala/simd/c128"

// Frequency-domain multiplication (FFT convolution)
signalFFT := make([]complex128, n)
kernelFFT := make([]complex128, n)
result := make([]complex128, n)
magnitude := make([]float64, n)

// Frequency-domain filtering
c128.Mul(result, signalFFT, kernelFFT)          // Complex multiply
c128.MulConj(result, signalFFT, kernelFFT)      // Cross-correlation

// Spectrogram and magnitude analysis
c128.Abs(magnitude, signalFFT)                  // Extract magnitude for display

Use Cases:

Abs/AbsSq: Spectrograms, power spectral density, frequency analysis
Conj: Cross-correlation, frequency-domain filtering
Mul/MulConj: FFT-based convolution, filtering, correlation

Benchmark (1024 elements, Intel i7-1260P AVX+FMA):

Operation	SIMD	Pure Go	Speedup
Mul	341 ns	757 ns	2.2x
MulConj	340 ns	749 ns	2.2x
Scale	253 ns	551 ns	2.2x
Add	86 ns	189 ns	2.2x
Abs	1326 ns	2260 ns	1.7x
AbsSq	367 ns	504 ns	1.37x
Conj	304 ns	474 ns	1.56x

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

Category	Operation	SIMD (ns)	Go (ns)	Speedup
Arithmetic	Add	84	446	5.3x
	Sub	84	335	4.0x
	Mul	86	436	5.1x
	Div	441	941	2.1x
	Scale	68	272	4.0x
	AddScalar	68	286	4.2x
	FMA	110	557	5.0x
Unary	Abs	66	365	5.6x
	Neg	66	306	4.6x
	Sqrt	658	1323	2.0x
	Reciprocal	447	920	2.1x
Reduction	DotProduct	162	859	5.3x
	Sum	82	184	2.3x
	Min	157	340	2.2x
	Max	154	352	2.3x
Statistical	Mean	82	184	2.3x
	Variance*	820	902	1.1x
	StdDev*	825	905	1.1x
Vector	EuclideanDistance	216	1071	5.0x
	Normalize	220	1080	4.9x
	CumulativeSum	428	425	1.0x
Range	Clamp	81	640	7.9x

*Variance/StdDev benchmarked at 4096 elements (SIMD benefits at larger sizes)

float32 Operations - SIMD vs Pure Go (1024 elements)

Category	Operation	SIMD (ns)	Go (ns)	Speedup
Arithmetic	Add	47	441	9.4x
	Sub	49	339	6.9x
	Mul	49	436	8.9x
	Div	138	655	4.8x
	Scale	40	299	7.4x
	AddScalar	39	272	7.0x
	FMA	64	444	6.9x
Unary	Abs	37	656	17.6x
	Neg	40	273	6.9x
Reduction	DotProduct	71	424	5.9x
	Sum	41	123	3.0x
	Min	65	340	5.2x
	Max	66	352	5.3x
Range	Clamp	47	701	14.8x

Activation Functions - SIMD vs Pure Go

float32 (1024 elements):

Function	SIMD (ns)	Go (ns)	Speedup	SIMD Throughput
Sigmoid	138	5906	43x	59.3 GB/s
ReLU	39	662	17x	211 GB/s
Tanh	138	28116	204x	59.5 GB/s

float64 (1024 elements):

Function	SIMD (ns)	Go (ns)	Speedup	SIMD Throughput
ReLU	68	646	9.5x	240 GB/s
Tanh	445	6230	14x	36.8 GB/s

Key Characteristics:

Tanh: 200x+ speedup for f32 - fast approximation with saturation vs math.Tanh
ReLU: Highest throughput (211-240 GB/s) - simple max(0, x) operation
Sigmoid: 43x speedup for f32 - fast approximation with exponential

Batch & Signal Processing (varied sizes)

Operation	Config	SIMD	Go	Speedup
DotProductBatch (f64)	256 vec × 100 rows	3.2 µs	20.5 µs	6.4x
DotProductBatch (f32)	256 vec × 100 rows	1.5 µs	9.8 µs	6.7x
ConvolveValid (f64)	4096 sig × 64 ker	26.6 µs	169 µs	6.3x
ConvolveValid (f32)	4096 sig × 64 ker	17.9 µs	80 µs	4.5x
ConvolveValidMulti (f64)	1000 sig × 64 ker × 2	13.4 µs	-	-
CubicInterpDot (f64)	241 taps	47 ns	88 ns	1.9x
CubicInterpDot (f32)	241 taps	21 ns	66 ns	3.1x
Int32ToFloat32Scale	1024 elements	40 ns	364 ns	9.0x
Int32ToFloat32Scale	4096 elements	153 ns	1439 ns	9.4x
Interleave2 (f64)	1000 pairs	216 ns	-	-
Deinterleave2 (f64)	1000 pairs	216 ns	-	-
Interleave2 (f32)	1000 pairs	109 ns	-	-
Deinterleave2 (f32)	1000 pairs	216 ns	-	-

Performance Summary

Package	Average Speedup	Best	Operations
f32	6.5x	21.8x (Abs)	32 functions
f64	3.2x	7.9x (Clamp)	32 functions
c128	1.77x	2.2x (Mul)	8 functions

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

Operation	Size	Time	Throughput
DotProduct	277	151 ns	29 GB/s
DotProduct	1000	513 ns	31 GB/s
Add	1000	775 ns	31 GB/s
Mul	1000	727 ns	33 GB/s
FMA	1000	890 ns	36 GB/s
Sum	1000	635 ns	13 GB/s
Mean	1000	677 ns	12 GB/s

float32 Operations

Operation	Size	Time	Throughput
DotProduct	100	37 ns	21 GB/s
DotProduct	1000	263 ns	30 GB/s
DotProduct	10000	2.78 µs	29 GB/s
Add	1000	389 ns	31 GB/s
Mul	1000	390 ns	31 GB/s
FMA	1000	479 ns	33 GB/s

Comparison vs Pure Go

Operation	Size	SIMD	Pure Go	Speedup
DotProduct (f32)	100	37 ns	137 ns	3.7x
DotProduct (f32)	1000	262 ns	1350 ns	5.2x
DotProduct (f64)	100	62 ns	138 ns	2.2x
DotProduct (f64)	1000	513 ns	1353 ns	2.6x
Add (f32)	1000	389 ns	2015 ns	5.2x
Sum (f32)	1000	343 ns	1327 ns	3.9x

Performance Notes

AMD64: Explicit SIMD provides 5x speedups for most operations compared to pure Go, with consistent high throughput across all vector sizes.
ARM64: NEON SIMD provides substantial speedups over pure Go across all operations:
- float32: 3.7x - 5.2x faster (4 elements per 128-bit vector)
- float64: 2.2x - 2.6x faster (2 elements per 128-bit vector)
CumulativeSum is inherently sequential (each element depends on the previous) and uses pure Go on all platforms.

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

On AMD64, the Min and Max functions fall back to pure Go for small slices:

float64: slices with fewer than 4 elements
float32: slices with fewer than 8 elements

This is because AVX assembly loads multiple elements at once (4 float64s or 8 float32s), which would cause out-of-bounds memory access on smaller slices.

The Go fallback for small slices is intentional and likely optimal - SIMD setup overhead (register loading, masking, horizontal reduction) would exceed the cost of a simple 2-3 element comparison loop.

Architecture Support

Architecture	Instruction Set	Status
AMD64	AVX-512	Full SIMD support
AMD64	AVX + FMA	Full SIMD support
AMD64	SSE2	Full SIMD support
ARM64	NEON/ASIMD	Full SIMD support
Other	-	Pure Go fallback

Design Principles

Pure Go assembly - Native Go assembler for maximum portability and easy cross-compilation
Runtime dispatch - CPU features detected once at init time, zero runtime overhead
Zero allocations - No heap allocations in hot paths
Safe defaults - Gracefully falls back to pure Go on unsupported CPUs
Boundary safe - Handles any slice length, not just SIMD-aligned sizes

Testing

The library includes comprehensive tests with pure Go reference implementations for validation:

# Run all tests
go test ./...

# Run tests with verbose output
task test

# Run benchmarks
task bench

# Compare SIMD vs pure Go performance
task bench:compare

# Show CPU SIMD capabilities
task cpu

See Taskfile.yml for all available tasks.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.githooks		.githooks
.github/workflows		.github/workflows
c128		c128
cpu		cpu
f32		f32
f64		f64
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
doc.go		doc.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

simd

Features

Installation

Quick Start

Packages

`cpu` - CPU Feature Detection

`f64` - float64 Operations

`f32` - float32 Operations

`c128` - complex128 Operations

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

float32 Operations - SIMD vs Pure Go (1024 elements)

Activation Functions - SIMD vs Pure Go

Batch & Signal Processing (varied sizes)

Performance Summary

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

float32 Operations

Comparison vs Pure Go

Performance Notes

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

Architecture Support

Design Principles

Testing

Contributing

License

About

Uh oh!

Releases

Uh oh!

Contributors 2

Languages

License

tphakala/simd

Folders and files

Latest commit

History

Repository files navigation

simd

Features

Installation

Quick Start

Packages

cpu - CPU Feature Detection

f64 - float64 Operations

f32 - float32 Operations

c128 - complex128 Operations

Performance

AMD64 (Intel Core i7-1260P, AVX+FMA)

float64 Operations - SIMD vs Pure Go (1024 elements)

float32 Operations - SIMD vs Pure Go (1024 elements)

Activation Functions - SIMD vs Pure Go

Batch & Signal Processing (varied sizes)

Performance Summary

ARM64 (Raspberry Pi 5, NEON)

float64 Operations

float32 Operations

Comparison vs Pure Go

Performance Notes

Known Limitations

Small Slice Fallback for Min/Max (AMD64)

Architecture Support

Design Principles

Testing

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Languages

`cpu` - CPU Feature Detection

`f64` - float64 Operations

`f32` - float32 Operations

`c128` - complex128 Operations