Skip to content

2.25.2.0-b354

@spolitov spolitov tagged this 19 Apr 05:05
Summary:
# YbHnsw: A Block-Optimized HNSW Implementation

This diff introduces the `YbHnsw` class, which implements HNSW (Hierarchical Navigable Small World) search with a block-based data organization, enabling efficient loading and unloading of data blocks based on available memory.

## Overview
- **Read-only access**: Generated from an existing USearch index.
- **Height-based vector ordering**: Vectors are sorted by their level height, meaning vectors with more levels have lower IDs. Consequently, a vector's index in a level matches its ID.

## Data Structure
The index consists of a **header** and multiple **blocks** of different types, allowing flexible memory management.
Blocks of the same type are stored contiguously, forming dedicated sections for each block type.

### Header Information
The header contains metadata about the index:
- `dimensions`: Vector dimensions in the index.
- `vector_data_size`: Size (in bytes) of a vector header (depends on dimensions).
- `entry`: ID of the entry vector for search.
- `max_level`: Maximum vector level in the index.
- `config`: Configuration of the source USearch index.
- `max_block_size`: Maximum block size (in bytes) used during index generation.
- `max_vectors_per_non_base_block`: Maximum number of vectors in a non-base-layer neighbors block.
- `vector_data_block`: ID of the first block containing vector headers.
- `vector_data_amount_per_block`: Number of vector records per vector headers block (depends on `max_block_size` and `vector_size`).
- `layers`: Information about each layer in the index.

### Layer Information
Each layer entry includes:
- `size`: Number of vectors in the layer.
- `block`: ID of the first block containing neighbors for this layer.
- `last_block_index`: Relative index of the last neighbors block.
- `last_block_vectors_amount`: Number of entries in the last neighbors block.

## Block Types

### 1. **Vector Header Blocks**
- Contain fixed-size metadata for each vector.
- Each block stores `vector_headers_per_block` vectors.
- To locate a vector's header:
  - **Block ID**: `vector_headers + (id / vector_headers_per_block)`
  - **Offset**: `(id % vector_headers_per_block) * vector_size`

#### `VectorData` Structure
- `base_layer_neighbors_block`: Block ID for base-layer neighbors.
- `base_layer_neighbors_begin`: Start offset of neighbors in the block.
- `base_layer_neighbors_end`: End offset of neighbors in the block.
- `aux_data_block`: Block ID for auxiliary data.
- `aux_data_begin`: Start offset of auxiliary data.
- `aux_data_end`: End offset of auxiliary data.
- `coordinates`: Vector coordinates (fixed per index, but may vary across indexes).

### 2. **Auxiliary Data Blocks**
- Store arbitrary auxiliary data as byte sequences.
- Vector headers specify the byte range for each vector.

### 3. **Non-Base Layer Blocks**
- Contain neighbor lists for vectors in non-base layers.
- To locate neighbors for a vector:
  - **Block ID**: `layer.block + (id / max_vectors_per_non_base_block)`
  - **Index in Block**: `id % max_vectors_per_non_base_block`

#### Structure
- **Finishes section**: Stores end offsets for each vector's neighbor list.
- **Neighbors section**: Contains actual neighbor data.
- For a given vector:
  - **Begin**: `finish` of the previous vector (or `0` for the first vector).
  - **End**: `finish` of the current vector.
- The neighbors section starts after the finishes section.

### 4. **Base Layer Blocks**
- Optimised for faster search (since base-layer operations dominate search time).
- Unlike non-base blocks, these **do not store finishes**—only neighbor lists.
- Vector headers directly specify the neighbors range (`begin` and `end`).

## Performance Comparison with USearch
Average of 4 runs:

### Mac
| Dimensions | Vectors  | Searches  | YbHnsw (s) | USearch (s) | Rate |
|------------|----------|-----------|------------|-------------|------|
| 8          | 2^16     | 2^16      | 5.301      | 6.104       | 0.868|
| 128        | 2^14     | 2^14      | 7.408      | 7.843       | 0.944|
| 512        | 2^12     | 2^12      | 11.764     | 12.042      | 0.977|

### Dev Server (n1-standard-64)
| Dimensions | Vectors  | Searches  | YbHnsw (s) | USearch (s) | Rate    |
|------------|----------|-----------|------------|-------------|---------|
| 8          | 2^16     | 2^16      | 12.525     | 13.591      | 0.922   |
| 128        | 2^14     | 2^14      | 41.578     | 41.751      | 0.999   |
| 512        | 2^12     | 2^12      | 55.119     | 54.548      | 1.01047 |

Performance is nearly identical, but `YbHnsw` offers key advantages:

### Key Benefits
1. **Block-friendly organization**: Enables efficient block caching (e.g., RocksDB-style caching).
2. **Arbitrary auxiliary data storage**: Eliminates the need for a reverse index in RocksDB.
   - RocksDB queries are ~20× slower than vector index searches.
   - Reducing these queries by half could improve PG vector performance by ~45%.

Also added logic to initialise random generator with unique seed in usearch.
Previously the seed was zero for all threads.
As result random number sequence was the same for all insert threads, and generated level had bad distribution in case of multi threaded insert.
Jira: DB-14824

Test Plan: yb_hnsw-test

Reviewers: aleksandr.ponomarenko, arybochkin

Reviewed By: aleksandr.ponomarenko, arybochkin

Subscribers: ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D40965
Assets 2
Loading