Release Scaling Vector Search: Numba, Cppyy, PeachPy, and SimSIMD · unum-cloud/usearch

USearch has now been used in two internet-scale companies/projects 🎉

In one case, it's only 1 Billion vectors. The second is closer to 10 Billion per node. So it feels like a good time to enter the Big ANN competition track 🐎 Still, we aim for an even larger scale and introduce a few improvements, such as the new CompiledMetric interface that allows one to mix C/C++, Python, and Assembly code for different hardware architectures in the same USearch configuration file 🤯 It should come in handy, given the rise in the number of exotic AI hardware vendors 😉

SIMD and JIT

The common idea across Unum projects is to use SIMD for hardware acceleration when we hit the wall.
Writing SIMD can be challenging, but distributing it may be even more difficult. You are compiling for modern hardware and modern libraries, and all of a sudden, someone deploys it on an old machine...

illegal instruction

A good balance can be achieved by compiling the core of USearch for higher compatibility and leaving plugs for hardware-specific optimizations. That is where Just-in-Time Compilation comes in. USearch allows combining the pre-compiled HNSW structure with JIT-ed distance metrics, which would tune the resulting assembly for the current target architecture. Such functionality was already supported with Numba JIT, but now you have more flexibility to choose different backends and define functions with different signatures:

MetricSignature.ArrayArray, aka float distance(scalar *, scalar *).
MetricSignature.ArrayArraySize, aka float distance(scalar *, scalar *, size_t).
MetricSignature.ArraySizeArraySize, aka float distance(scalar *, size_t, scalar *, size_t).

This enum is a core part of CompiledMetric and is passed down to our CPython binding.

class CompiledMetric(NamedTuple):
    pointer: int
    kind: MetricKind
    signature: MetricSignature

Numba JIT

Just to remind you, a minimal Numba + USearch example would look like this:

from numba import cfunc, types, carray

ndim = 256
signature = types.float32(
    types.CPointer(types.float32),
    types.CPointer(types.float32))

@cfunc(signature)
def inner_product(a, b):
    a_array = carray(a, ndim)
    b_array = carray(b, ndim)
    c = 0.0
    for i in range(ndim):
        c += a_array[i] * b_array[i]
    return 1 - c

index = Index(ndim=ndim, metric=CompiledMetric(
    pointer=inner_product.address,
    kind=MetricKind.IP,
    signature=MetricSignature.ArrayArray,
))

No scary C++ or Assembly involved.

Cppyy

Similarly, you can use Cppyy with Cling to JIT-compile native C or C++ code and pass it to USearch, which may be a good idea if you want to explicitly request loop-unrolling or other low-level optimizations!

import cppyy
import cppyy.ll

cppyy.cppdef("""
float inner_product(float *a, float *b) {
    float result = 0;
#pragma unroll
    for (size_t i = 0; i != ndim; ++i)
        result += a[i] * b[i];
    return 1 - result;
}
""".replace("ndim", str(ndim)))

function = cppyy.gbl.inner_product
index = Index(ndim=ndim, metric=CompiledMetric(
    pointer=cppyy.ll.addressof(function),
    kind=MetricKind.IP,
    signature=MetricSignature.ArrayArraySize,
))

PeachPy

Instead of writing in C or C++, you can go one level down and directly assemble a kernel for x86. Below is an example of constructing the "Inner Product" distance for 8-dimensional f32 vectors for x86 using PeachPy.

from peachpy import (
    Argument,
    ptr,
    float_,
    const_float_,
)
from peachpy.x86_64 import (
    abi,
    Function,
    uarch,
    isa,
    GeneralPurposeRegister64,
    LOAD,
    YMMRegister,
    VSUBPS,
    VADDPS,
    VHADDPS,
    VMOVUPS,
    VFMADD231PS,
    VPERM2F128,
    VXORPS,
    RETURN,
)

a = Argument(ptr(const_float_), name="a")
b = Argument(ptr(const_float_), name="b")

with Function(
    "inner_product", (a, b), float_, target=uarch.default + isa.avx2
) as asm_function:
    
    # Request two 64-bit general-purpose registers for addresses
    reg_a, reg_b = GeneralPurposeRegister64(), GeneralPurposeRegister64()
    LOAD.ARGUMENT(reg_a, a)
    LOAD.ARGUMENT(reg_b, b)

    # Load the vectors
    ymm_a = YMMRegister()
    ymm_b = YMMRegister()
    VMOVUPS(ymm_a, [reg_a])
    VMOVUPS(ymm_b, [reg_b])

    # Prepare the accumulator
    ymm_c = YMMRegister()
    ymm_one = YMMRegister()
    VXORPS(ymm_c, ymm_c, ymm_c)
    VXORPS(ymm_one, ymm_one, ymm_one)

    # Accumulate A and B products into C
    VFMADD231PS(ymm_c, ymm_a, ymm_b)

    # Reduce the contents of a YMM register
    ymm_c_permuted = YMMRegister()
    VPERM2F128(ymm_c_permuted, ymm_c, ymm_c, 1)
    VADDPS(ymm_c, ymm_c, ymm_c_permuted)
    VHADDPS(ymm_c, ymm_c, ymm_c)
    VHADDPS(ymm_c, ymm_c, ymm_c)

    # Negate the values from "similarity" to "distance"
    VSUBPS(ymm_c, ymm_one, ymm_c)

    # A common convention is to return floats in XMM registers
    RETURN(ymm_c.as_xmm)

python_function = asm_function.finalize(abi.detect()).encode().load()
metric = CompiledMetric(
    pointer=python_function.loader.code_address,
    kind=MetricKind.IP,
    signature=MetricSignature.ArrayArray,
)
index = Index(ndim=ndim, metric=metric)

SimSIMD

Last but not the list, for the most common metrics and vector spaces, I am going to provide pre-compiled optimized CPython Capsules as part of SimSIMD. Let us know which embeddings you use the most - OpenAI, BERT, ViT, or maybe UForm?

Release Notes: 0.19.0 (2023-06-24)

Add

NDCG evaluation (25cb6af)
PeachPy example (1bf0bbb)
Restore index from disk with only a path (b7aec4c)
Windows file-mapping API (794f173)

Fix

Re-indexing vectors on restart (529c137)

Improve

Preserve metric_kind_t in snapshots (5ec0b65)
Support Cppyy JIT compilation (80f99cd)
Support signatures for JIT-ed metrics (c89e145)

Hashes

Source code (zip): 19c2e9f41134c17ac5db20567b268ee1fce1b4d5dbb42a8b6e5a252ac9e242de
Source code (tar.gz): 34e5bd00639504d752f0ab5eb96e62bc34ab18537e6e31823f3e487a4d8139a3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling Vector Search: Numba, Cppyy, PeachPy, and SimSIMD