Narek/external index storage #335

Ngalstyan4 · 2024-01-09T06:05:59Z

Ashot jan Hi!

tldr;

This is an attempt to add external storage to USearch to help our upgrades at Lantern !
It allows swapping the storage format (usearch-v2, usearch-v3, lantern-postgres) without touching the core index structures.

As far as I can tell it does not have a runtime performance impact.

Would you be open to merging this kind of interface into upstream usearch or should we maintain it outside?

We have been using a fork of USearch with this kind of external storage for about half a year at Lantern. This is an attempt to upstream it. We have chatted about this before so some of the stuff below may be repetition, but putting it here for completeness.

Motivation

Currently, the core high-performance implementation of vector search is weaved through storage, serialization, file IO interfaces. This makes it harder to:

Change the underlying storage.
Change the serialization format (e.g. usearch's planned v2->v3 transition)
Add storage-level features such as neighbor list compression (motivation: when m in hnsw is large, neighbor lists become a significant portion of index memory footprint)

One might argue that (1) can be achieved by passing a custom allocator to index_gt or index_dense_gt. This has limitations and did not work for us for two reasons:

(most important) allocators tie the lifetime of the index to the lifetime of index_gt. In Lantern, we are dealing with a persistent index - all changes are saved to postgres data files and replicated if needed. So, the index memory needs to outlive any usearch data structures.
Existing allocator interfaces allows defining allocation logic per memory-type granularity (memory for vectors, memory for nodes, etc.). We needed to do allocations with a different kind of partitioning (memory for all components of node i, node i+1, etc)

The storage interface proposed here helps us achieve the goals above.

Design

This PR adds a storage_at template parameter to usearch index types which implements:

node and vector allocation and reset
Access management for concurrent node and vector access
Index save/load from a stream
Viewing a memory mapped index
Compile-time exhaustive API type-checking for storage providers

The exact storage layout is opaque to the rest of usearch - all serialization/deserialization logic is in storage_at so new storage formats can be implemented without touching the rest of the code.
As an example, I implemented a new storage provider in std_storage.hpp that uses cpp standard library containers and stores nodes and vectors adjacent to each other when serializing to a file (similar to usearch v1 format, but this one adds padding between node tape and vector tape in serialization to make sure view() does not result in unaligned memory accesses).

The Storage API

I designed the storage API around how the current usearch v2 storage worked. I tried to minimize amount of changes in index.hpp and index_dense.hpp to hopefully make reviewing easier. I think the storage interface can be simplified and improved in many ways, especially after a usearch v3 format transition. I am open to changing the full API, so long as there is some kind of storage API.

NOTE: There is no new logic in this PR. most of it is just factoring out storage-related interfaces and functions to the separate header.

The storage API, as defined in the beginning of storage.hpp and implemented by several storage backends.
index_gt and index_dense_gt were modified to use this storage API.
I added a helper type-enforcer macro that runs compile-time checks to make sure the provided interface meets the necessary interface- requirements to be a usearch storage provider.

Next?

This has some rough edges, most of which should be listed below. I will come back and update this if more things come up.
Before putting time into those, however, I just wanted to see whether you would be open to merging this into mainline usearch. This would help us at Lantern a lot and would be a big step towards upstream-usearch compatibility for us.

We will likely start using a simplified version of this API from Lantern soon, so can report back on how well it works for our case.

TODOs

Fix comments around view+view_internal+reset
Figure out whether (or how?) storage layer should maintain info about number of vectors it is storing
- Needed for save/restore/reset
- Hard with set_at interface which does not tell whether the old spot was updated or new spot is created
implement swap move for index_dense.hpp
Add tests around swap/move/copy indexes
Implement swap for index.hpp
Add node_copy to storage api
(Maybe?) Move config_, nodes_count,... etc and other serialization to storage_ as well

Implement compact in index.hpp
Add memory_useage() interface to storage
Check that My choices of taking a reference vs r-value reference are correct

Save serialization_config on the stored file

Store punned info in index binary to prevent accidental wrong code loading the index

Split precomputed_constants to keep storage-related constants in storage layer
(maybe?) Move slot_lookup into storage
(maybe?) Move or copy nodes_count_ to storage so clear() and reset() can have more intuitive implementations
Make all serialization interfaces take and use progress& (I copied the API from current usearch and some APIs there are not taking progress&)
Get rid of matrix_rows_ and matrix_cols_ in storage_v2

Ngalstyan4

Added some comments, to hopefully help in the review process

Ngalstyan4 · 2024-01-09T06:08:17Z

cpp/test.cpp

@@ -77,7 +77,7 @@ void test_cosine(index_at& index, std::vector<std::vector<scalar_at>> const& vec
    expect((index.stats(0).nodes == 3));

    // Check if clustering endpoint compiles
-    index.cluster(vector_first, 0, args...);
+    // index.cluster(vector_first, 0, args...);


storage interface has not yet been added for this endpoint, so I removed the test.
Will add it back when implemented

Ngalstyan4 · 2024-01-09T06:08:29Z

cpp/test.cpp

-        auto compaction_result = index.compact();
-        expect(bool(compaction_result));
+        // auto compaction_result = index.compact();
+        // expect(bool(compaction_result));


same as above

Ngalstyan4 · 2024-01-09T06:10:46Z

cpp/test.cpp

+            using key_t = std::int64_t;
+            {
+                using slot_t = std::uint32_t;
+                using storage_v2_t = storage_v2_at<key_t, slot_t>;


Runs tests for the two storage provider APIs

storage_v2_t is the current storage interface, rearranged into a separate API
std_storage_t is an example storage provider that demonstrates the API use.
It stores all data in std:: containers and serializes data to disk similar to usearch v1.
it does not do error handling (asserts all errors

Ngalstyan4 · 2024-01-09T06:11:52Z

include/usearch/index.hpp

+using level_t = std::int16_t;
+
+struct precomputed_constants_t {
+    double inverse_log_connectivity{};
+    std::size_t neighbors_bytes{};
+    std::size_t neighbors_base_bytes{};
+};


These were moved higher up from later in this file, so I can refer to them from the node abstract type below.

precomputed_constants_t is used from storage to figure out the sizes of node_t structs.
I think it would make sense to split this struct, move neighbors_* to storage.hpp and keep inverse_log_connectivity as part of index_gt.

Ngalstyan4 · 2024-01-09T06:12:14Z

include/usearch/index.hpp

+ *          then the { `neighbors_count_t`, `compressed_slot_t`, `compressed_slot_t` ... } sequences
+ *          for @b each-level.
+ */
+template <typename key_at, typename slot_at> class node_at {


Mostly the same as the node_t structure from before.

Below are all the changes

Moved static constexpr std::size_t node_head_bytes_() from a private member of index_gt to a public member here, that is called node_t::head_size_bytes()

Moved out of index_gt for global visibility as it is not used from storage.hpp

Moved node_t- related functions such as node_bytes_ to be member functions inside here so all node_t APIs are grouped together.
NOTE: The only node-related API outside of node_t now is the neighbor iterator and retriever functions. I can move those here as well, but this diff was already becoming very large, so I postponed that for now

Moved precompute_ to be a static member here so it can use the template arguments of node_t. It used to be a private member in index_gt. As already noted, inverse_log_connectivity of precomputed_constants_t does not really belong here. Happy to address it.

Ngalstyan4 · 2024-01-09T06:15:08Z

include/usearch/index.hpp


-        other.nodes_count_ = nodes_count_.load();
-        other.max_level_ = max_level_;
-        other.entry_slot_ = entry_slot_;


copy not implemented

Ngalstyan4 · 2024-01-09T06:16:16Z

include/usearch/index.hpp

     */
    template <typename input_callback_at, typename progress_at = dummy_progress_t>
    serialization_result_t load_from_stream(input_callback_at&& input, progress_at&& progress = {}) noexcept {

        serialization_result_t result;

-        // Remove previously stored objects
-        reset();


This is done at a higher level API.
We cannot do it here because the higher level in index_dense could have already loaded vectors into storage.
Calling reset on the inner index would call reset on storage and wipe out newly loaded vectors.

This is kind of bad and tricky. I have not found a better design around this.
So far, I think this trickyness is fundamental in usearch v2 storage (separate vectors and nodes) where index_dense owns and takes care of vectors, while index takes care of nodes.

I think this division requires that any storage which stores both have shared ownership between index_dense and index.

Assuming in Usearch v3 we move to a format that stores vectors and nodes together, this problem will go away.

Open to other suggestions in the meantime.

Ngalstyan4 · 2024-01-14T00:13:58Z

include/usearch/index.hpp

-    static_assert(                                                 //
-        sizeof(typename tape_allocator_traits_t::value_type) == 1, //
-        "Tape allocator must allocate separate addressable bytes");
+    using span_bytes_t = span_gt<byte_t>;


for the result of a call to node_bytes

Ngalstyan4 · 2024-01-14T00:14:32Z

include/usearch/index_dense.hpp


        // Load metadata and choose the right metric
        {
-            index_dense_head_buffer_t buffer;
-            if (!input(buffer, sizeof(buffer)))


storage_at::load_vectors_from_stream takes a generic buffer and reads off bytes into it from the specified section in the storage buffer, per storage format spec

Ngalstyan4 · 2024-01-14T00:14:56Z

include/usearch/index_dense.hpp

@@ -748,11 +750,10 @@ class index_dense_gt {
        unique_lock_t lookup_lock(slot_lookup_mutex_);

        std::unique_lock<std::mutex> free_lock(free_keys_mutex_);
+        // storage_ cleared by typed_ todo:: is this confusing?
        typed_->clear();


storage_ is reset by typed_.reset();
nothing bad will happen here if I do it again, but this seemed clearer

* Add move construction tests and fix an issue caused by them * Only consider zero length IO an error if input buffer was larger than zero * Move option-override policy opt-in before policy definitions so overrides actually take effect

SIMSIMD, OPENMP and FP16 related cmake options are not properly propaged to compiler header definitions, when they are set to non-default values. This commit fixes compile definitions so those values are always propagated properly E.g., by default, simsimd usage is turned off and as we see in the commands below, correct default `#define`s(i.e. `-DUSEARCH_USE_SIMSIMD=0`) are passed to the compiler: cmake .. make VERBOSE=1 > cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp && /usr/bin/c++ -DUSEARCH_USE_OPENMP=0 -DUSEARCH_USE_SIMSIMD=0 ... -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp But, if we try to enable simsimd via cmake for benchmarking and shared C libraries, we do not get the corresponding -DUSEARCH_USE_SIMSIMD=1 definition. cmake .. -DUSEARCH_USE_SIMSIMD=1 make VERBOSE=1 cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp && /usr/bin/c++ -DUSEARCH_USE_OPENMP=0 ... -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp Note that no definition for `USEARCH_USE_SIMSIMD` was passed to the compiler. Internally, the lack simsimd config definition assumes -DUSEARCH_USE_SIMSIMD=0 value. (see [1_simsimd_logic_in_plugins]) When compiling after adding this commit, we see that we can successfully enable simsimd via cmake option cmake .. -DUSEARCH_USE_SIMSIMD=1 make VERBOSE=1 cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp && /usr/bin/c++ -DUSEARCH_USE_FP16LIB=1 -DUSEARCH_USE_OPENMP=0 -DUSEARCH_USE_SIMSIMD=1 -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp [1_simsimd_logic_in_plugins]: https://github.com/unum-cloud/usearch/blob/4747ef42f4140a1fde16118f25f079f9af79649e/include/usearch/index_plugins.hpp#L43-L45

Copied the logic from simsimd. Alternatively, the whole block could be dropped to offload detection to simsimd

index_plugins configures simsimd and when simsimd is included before this configuration gets a chance to run during compilation, simsimd.h may be misconfigured In particular, index_plugins propagates USEARCH_FP16LIB cmake options as !SIMSIMD_NATIVE_FP16 (see [1]) and if simsimd.h is included before index_plugins, wrong value of SIMSIMD_NATIVE_FP16 may be chosen [1]: https://github.com/unum-cloud/usearch/blob/ce54b814a8a10f4c0c32fee7aad9451231b63f75/include/usearch/index_plugins.hpp#L50

…h is actually used by index.hpp

passing all functional tests, but there are memory leaks

…ng it in index_* classes

…lignment

…view()

… issues under some conditions

…allow move

* Add move construction tests and fix an issue caused by them * Only consider zero length IO an error if input buffer was larger than zero * Move option-override policy opt-in before policy definitions so overrides actually take effect

# [2.9.0](v2.8.16...v2.9.0) (2024-02-22) ### Add * SQLite binding ([222de55](222de55)) * String distances to SQLite ([ae4d0f0](ae4d0f0)) ### Docs * Header refreshed ([7465c29](7465c29)) * Py and SQLite extensions ([550624b](550624b)) * README.md link to Joins (#327) ([1279c54](1279c54)), closes [#327](#327) ### Fix * bug reports were immediately marked invalid ([c5fc825](c5fc825)) * Error handling, mem safety bugs #335 (#339) ([4747ef4](4747ef4)), closes [#335](#335) [#339](#339) * Passing SQLite tests ([6334983](6334983)) * Reported number of levels ([9b1a06a](9b1a06a)) * Skip non-Linux SQLite tests ([b02d262](b02d262)) * SQLite cosine function + tests ([55464fb](55464fb)) * undefined var error in `remove` api ([8d86a9e](8d86a9e)) ### Improve * Multi property lookup ([e8bf02c](e8bf02c)) * Support multi-column vectors ([66f1716](66f1716)) ### Make * `npi ci` (#330) ([5680920](5680920)), closes [#330](#330) * Add 3.12 wheels ([d66f697](d66f697)) * Change include paths ([21db294](21db294)) * invalid C++17 Clang arg ([2a6d779](2a6d779)) * Link libpthread for older Linux GCC builds (#324) ([6f1e5dd](6f1e5dd)), closes [#324](#324) * Parallel CI for Python wheels ([a9ad89e](a9ad89e)) * Upgrade SimSIMD & StringZilla ([5481bdf](5481bdf)) ### Revert * Postpone Apache Arrow integration ([5d040ca](5d040ca))

Ngalstyan4 commented Jan 14, 2024

View reviewed changes

Ngalstyan4 mentioned this pull request Jan 14, 2024

Fix some miscelenious issues encountered while working on #335 #339

Merged

Ngalstyan4 mentioned this pull request Jan 29, 2024

Lantern’s Performance vs. pgvector - Authenticity and Future Improvements lanterndata/lantern#272

Open

Ngalstyan4 added 26 commits January 29, 2024 21:19

Pull in simsimd headers only if simsimd feature is enabled

d416092

Fix AVX512 detection logic

211a103

Copied the logic from simsimd. Alternatively, the whole block could be dropped to offload detection to simsimd

Fix bench_cpp binary name in benchmark documentation

c71e9c6

Update simsimd to fix fp16 type inference

b10c582

Update SimSIMD to v3.7.5

1bd4842

Add clang-tidy clarification. todo:: is this needed?

daf7ca2

Initial trials to move node allocation outside of index.hpp

2df158c

Move storage parameter to right before metric

d4fb277

Slowly moving index storage outside of index.hpp

aac0233

Make external storage more functional by adding a vector storage whic…

8209b52

…h is actually used by index.hpp

Move storage to a separate class and revert stats back to index

a16e11f

External storage with usearch working

76325fd

passing all functional tests, but there are memory leaks

Add size to bitset_gt

c36756b

ammend to the one before the last one

9b30845

Remove per-function call storage argument in favor of global storage

b78f4f9

Get rid of global storage lock

ec9369a

Fix memory leaks

7e966b5

Rename capacity to size in added bitset_gt size

41f6039

Move node sizing functions to node_t definition

c28751a

add assert and get rid of strange resizing logic

e2d2670

Improve node sizing api

d81d58d

Improve node_t interface, move precompute_ inside

f1b47f3

Clean node allocation API

b48a657

Make storage pass-by reference for ergonimics

ee9bb58

Ngalstyan4 added 24 commits January 30, 2024 06:26

Move viewed_file_ state to storage_

f37bb7a

Fix first bug wound by typechecker

4291071

Fix nodes_ allocator to be what in original storage_v2 it was

08db9ec

Add a simple storage interface that uses std containers

338e2fe

Add storage choice to the tests

c6ad76d

Cleanup and rename std storage

25c7ca4

Improve std storage code

6ff2c8f

Add storage argument to index_dense_gt as well

a7cc87c

Add a note on later using node_t from storage_, instead of re-includi…

d74263a

…ng it in index_* classes

Add note on exchange not working with std::allocator template argument

69cdbdb

Add a todo on storage::memory_usage

96d9c53

Rename storage_v2 -> storage_v2_at

e024fca

Add a note on index_* not being movable

9b979bd

Rename view sub-interfaces

6e11258

Add setters to the enforced storage API

ea5cdbb

Fix std storage commentdoc

a745439

Cleanup and improve storage naming in test.cpp

de89cef

Bugfix: make sure default storage types allocate memory with proper a…

72d68a0

…lignment

Add proper alignment for std_storage so UBsan will not complain from …

87ab073

…view()

Bring back original clang-tidy

b2a3d3a

Update vectors allocator back to default since custom one has compile…

c504e18

… issues under some conditions

Store storage pointer instead of storage reference in typed index to …

8dccd68

…allow move

Add (quite questionable) move constructors

cae7210

Initialize skiped usearch opt opts.multi

ec3ed82

Ngalstyan4 force-pushed the narek/external-index-storage branch from 03ced4c to ec3ed82 Compare January 30, 2024 06:40

Ngalstyan4 mentioned this pull request Mar 3, 2024

Add external retriever to usearch so vector nodes can be externally stored and managed #171

Closed

3 tasks

ashvardanian force-pushed the main-dev branch from c2747d1 to 47f94ef Compare April 1, 2024 21:08

ashvardanian force-pushed the main-dev branch from ccb2c73 to 62f9b61 Compare April 29, 2024 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narek/external index storage #335

Narek/external index storage #335

Ngalstyan4 commented Jan 9, 2024

Ngalstyan4 left a comment

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 9, 2024

Ngalstyan4 Jan 14, 2024

Ngalstyan4 Jan 14, 2024

Ngalstyan4 Jan 14, 2024

Narek/external index storage #335

Are you sure you want to change the base?

Narek/external index storage #335

Conversation

Ngalstyan4 commented Jan 9, 2024

tldr;

Motivation

Design

The Storage API

Next?

TODOs

Ngalstyan4 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment