Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Up-to 200x Faster SIMD-Accelerated Distance Functions #19454

Open
ashvardanian opened this issue Oct 31, 2023 · 5 comments
Open

ENH: Up-to 200x Faster SIMD-Accelerated Distance Functions #19454

ashvardanian opened this issue Oct 31, 2023 · 5 comments
Labels
enhancement A new feature or improvement scipy.spatial

Comments

@ashvardanian
Copy link

ashvardanian commented Oct 31, 2023

Introduction:
SciPy's spatial distance computations are fundamental to various scientific and data science tasks. To accelerate these computations, it is imperative to ensure that the underlying math operations are optimized for modern hardware.

Background:
Currently, SciPy relies on NumPy for math operations, which further depends on the underlying BLAS implementation. However, these BLAS libraries might not be fully optimized for the latest hardware advancements, potentially limiting the performance.

The SimSIMD Solution:
I've developed a low-level library called SimSIMD, which provides accelerated implementations of commonly used distance functions. Notably, SimSIMD is already in use by projects like USearch and ClickHouse, and is an optional backend in LangChain. The library boasts specialized backends for:

  • x86: AVX2 and AVX-512 F/FP16/VNNI
  • Arm: NEON and SVE

These cover most CPUs produced in the past decade, offering potential speed-ups.

Evidence:

While SimSIMD is not a complete replacement for SciPy's API, it focuses on the most used distance functions. Further functions can be incorporated based on the community's feedback.

Proposal:
Given the potential benefits, I propose to consider integrating SimSIMD as an optional backend in SciPy, as suggested on the SciPy Slack. This would offer users an optimized pathway for spatial distance computations on modern hardware platforms.

@ashvardanian ashvardanian added the enhancement A new feature or improvement label Oct 31, 2023
@dschmitz89
Copy link
Contributor

dschmitz89 commented Oct 31, 2023

Sounds exciting! Which SciPy version did you use in your benchmarks? Many of SciPy's distance metrics were reimplemented in C++ recently (version 1.11.0).

@rgommers
Copy link
Member

Thanks for this proposal @ashvardanian, this is quite interesting. An optional dependency indeed seems like a good idea to me, since there's a lot of interest in making these distance functions faster (also Cc @Micky774 who worked on accelerating distance functions in scikit-learn).

My main question right now is about dtype support - from the README it seems that float64 is not supported? That's by far the most heavily used dtype, so I'm wondering how you are looking at that.

@ashvardanian
Copy link
Author

@rgommers, valid point! I wasn't originally expecting any gains for double-precision functions, but turns out they are quite significant. @dschmitz89, I compare to the most recent version available at the time, which is now 1.11.4. I'm attaching the outcomes of my last commit, benchmarked on the Sapphire Rapids CPU.

Benchmarking SimSIMD vs. SciPy

  • Vector dimensions: 1536
  • Vectors count: 1000
  • Hardware capabilities: serial, x86_avx2, x86_avx512, x86_avx2fp16, x86_avx512fp16, x86_avx512vpopcntdq, x86_avx512vnni
  • NumPy BLAS dependency: openblas64
  • NumPy LAPACK dependency: dep140640983012528

Between 2 Vectors, Batch Size: 1

Datatype Method Ops/s SimSIMD Ops/s SimSIMD Improvement
f64 scipy.cosine 63,612 572,605 9.00 x
f64 scipy.sqeuclidean 238,547 915,596 3.84 x
f64 numpy.inner 449,499 986,522 2.19 x

Between 2 Vectors, Batch Size: 1,000

Datatype Method Ops/s SimSIMD Ops/s SimSIMD Improvement
f64 scipy.cosine 68,962 1,457,172 21.13 x
f64 scipy.sqeuclidean 247,727 1,535,547 6.20 x
f64 numpy.inner 463,509 1,512,004 3.26 x

@ashvardanian
Copy link
Author

Porting to SVE-powered Graviton 3 chips also yields good results.

Benchmarking SimSIMD vs. SciPy

  • Vector dimensions: 1536
  • Vectors count: 1000
  • Hardware capabilities: serial, arm_neon, arm_sve
  • NumPy BLAS dependency: openblas64
  • NumPy LAPACK dependency: openblas64

Between 2 Vectors, Batch Size: 1

Datatype Method Ops/s SimSIMD Ops/s SimSIMD Improvement
f64 scipy.cosine 40,729 725,382 17.81 x
f64 scipy.sqeuclidean 160,812 728,114 4.53 x
f64 numpy.inner 473,443 767,374 1.62 x
f64 scipy.jensenshannon 15,684 38,528 2.46 x
f64 scipy.kl_div 49,983 61,811 1.24 x

Between 2 Vectors, Batch Size: 1,000

Datatype Method Ops/s SimSIMD Ops/s SimSIMD Improvement
f64 scipy.cosine 41,130 1,460,850 35.52 x
f64 scipy.sqeuclidean 162,147 1,486,255 9.17 x
f64 numpy.inner 473,856 1,580,136 3.33 x

@ashvardanian
Copy link
Author

ashvardanian commented Mar 15, 2024

@rgommers and @dschmitz89, hi 👋

Small update: with SimSIMD v4 and newer, return values are now also 64-bit floats, similar to SciPy, and more input types are supported in dot-products - including complex128, complex64, and the missing complex32.

I've also changed the dynamic dispatch strategy. Aside from serial and Arm backends, on x86 I now differentiate Haswell, Skylake, Ice Lake, and Sapphire Rapids. The older CPUs got noticeable speed bump in some workloads, especially the bit-level Hamming and Jaccard distances.

I am also now covering a broader build matrix including all Python versions supported by PyPi - 105. It's more than NumPy (35) and SciPy (24) combined, so should be easy to integrate, if more performance is needed 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.spatial
Projects
None yet
Development

No branches or pull requests

3 participants