VectorDB (from Scratch with HNSW)

VectorDB is a vector database implementation written in C++ that supports approximate nearest neighbor (ANN) search using the Hierarchical Navigable Small World (HNSW) algorithm. It is designed from the ground up with no third-party ANN libraries, to provide a clean, educational, and extensible implementation of modern ANN techniques.

Features

Vector storage: stores dense vectors of arbitrary dimensionality (e.g., 2D, 3D, or higher).
HNSW graph construction:
- Random level assignment for vertices (pyramid structure).
- Greedy descent through higher layers.
- Layer 0 neighbor collection (efConstruction).
- Bidirectional linking with duplicate/self-loop checks.
Approximate nearest neighbor search:
- Greedy descent from top layer.
- Candidate/result heaps at layer 0.
- Tunable efSearch parameter for recall vs. speed.
Pointer-free, index-based design for fast access and portability.
CRUD API:
- Insert vectors by ID
- Update existing vectors
- Remove vectors
- Retrieve vectors
Fully written in modern C++ (C++17), no external dependencies.

Background

What is a Vector Database?

A vector database stores high-dimensional numerical vectors (embeddings) and provides efficient nearest neighbor queries.
These are widely used in:

Recommendation systems
Natural Language Processing (semantic search, RAG pipelines)
Computer Vision (image similarity, clustering)
Robotics and navigation

What is HNSW?

HNSW (Hierarchical Navigable Small World) is one of the most powerful algorithms for Approximate Nearest Neighbor (ANN) search.
It builds a multi-layer graph where:

Layer 0 contains all vectors.
Upper layers contain progressively sparser “express lanes.”
Search starts at the top, descends greedily, and finishes at layer 0 using candidate expansion.

This structure balances speed and recall, and scales to millions of vectors.

🔍 Design Note: Heap Strategy in `collect_vertices`

In the original HNSW paper by Malkov & Yashunin (2018), the neighbor collection at layer 0 was implemented using two separate heaps:

A candidate priority queue (min-heap by distance).
A result set (max-heap, bounded by efConstruction).

Algorithm (Malkov’s design):

Push the entry point into both heaps.
Pop the closest element from the candidate heap.
Explore its neighbors.
If a neighbor is closer than the farthest in the result set, insert it into both heaps.
Repeat until the candidate queue is empty or stable.

This works well in practice (especially with larger efConstruction), but has a subtle limitation:
a candidate may be pushed into the result set before its closer neighbors are discovered, leading to slightly “noisy” results.

Current Approach (Improved Heap Design)

In this implementation, I simplify the process by using a single min-heap owned by the class:

When an element is popped, its neighbors are explored and pushed if valid.
The popped element is then reinserted into the same heap.
This ensures that the heap always reflects the true closest elements at every step.

Advantages:

Eliminates the need for a separate result heap.
Prevents premature acceptance of suboptimal candidates.
Keeps the top of the heap always aligned with the best current neighbor.

Effectively, this turns neighbor collection into a continuous best-first search, making the algorithm both simpler and conceptually cleaner, while maintaining the same guarantees as the original design.

Usage

Inserting vectors

VectorDB vg;

// generate random 3D vectors in [0, 1]
vector<float> t1 = vg.vector_generator_3D(0.0f, 1.0f);
vg.insert_hnsw("coco", t1);

vector<float> t2 = vg.vector_generator_3D(0.0f, 1.0f);
vg.insert_hnsw("toto", t2);

vector<float> t3 = vg.vector_generator_3D(0.0f, 1.0f);
vg.insert_hnsw("lulu", t3);

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
inc		inc
Hnsw		Hnsw
Hnsw.cpp		Hnsw.cpp
LICENSE		LICENSE
README.md		README.md
Vertex.h		Vertex.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VectorDB (from Scratch with HNSW)

Features

Background

What is a Vector Database?

What is HNSW?

🔍 Design Note: Heap Strategy in `collect_vertices`

Current Approach (Improved Heap Design)

Usage

Inserting vectors

About

Uh oh!

Releases

Packages

Languages

License

tslime/VectorDB

Folders and files

Latest commit

History

Repository files navigation

VectorDB (from Scratch with HNSW)

Features

Background

What is a Vector Database?

What is HNSW?

🔍 Design Note: Heap Strategy in collect_vertices

Current Approach (Improved Heap Design)

Usage

Inserting vectors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

🔍 Design Note: Heap Strategy in `collect_vertices`

Packages