-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Narek/external index storage #335
base: main-dev
Are you sure you want to change the base?
Narek/external index storage #335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments, to hopefully help in the review process
@@ -77,7 +77,7 @@ void test_cosine(index_at& index, std::vector<std::vector<scalar_at>> const& vec | |||
expect((index.stats(0).nodes == 3)); | |||
|
|||
// Check if clustering endpoint compiles | |||
index.cluster(vector_first, 0, args...); | |||
// index.cluster(vector_first, 0, args...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage interface has not yet been added for this endpoint, so I removed the test.
Will add it back when implemented
auto compaction_result = index.compact(); | ||
expect(bool(compaction_result)); | ||
// auto compaction_result = index.compact(); | ||
// expect(bool(compaction_result)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
using key_t = std::int64_t; | ||
{ | ||
using slot_t = std::uint32_t; | ||
using storage_v2_t = storage_v2_at<key_t, slot_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runs tests for the two storage provider APIs
storage_v2_t
is the current storage interface, rearranged into a separate API
std_storage_t
is an example storage provider that demonstrates the API use.
It stores all data in std::
containers and serializes data to disk similar to usearch v1
.
it does not do error handling (asserts all errors
using level_t = std::int16_t; | ||
|
||
struct precomputed_constants_t { | ||
double inverse_log_connectivity{}; | ||
std::size_t neighbors_bytes{}; | ||
std::size_t neighbors_base_bytes{}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were moved higher up from later in this file, so I can refer to them from the node abstract type below.
precomputed_constants_t
is used from storage to figure out the sizes of node_t structs.
I think it would make sense to split this struct, move neighbors_* to storage.hpp and keep inverse_log_connectivity as part of index_gt.
* then the { `neighbors_count_t`, `compressed_slot_t`, `compressed_slot_t` ... } sequences | ||
* for @b each-level. | ||
*/ | ||
template <typename key_at, typename slot_at> class node_at { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly the same as the node_t
structure from before.
Below are all the changes
- Moved
static constexpr std::size_t node_head_bytes_()
from a private member ofindex_gt
to a public member here, that is callednode_t::head_size_bytes()
- Moved out of
index_gt
for global visibility as it is not used fromstorage.hpp
- Moved
node_t
- related functions such asnode_bytes_
to be member functions inside here so allnode_t
APIs are grouped together.
NOTE: The only node-related API outside ofnode_t
now is the neighbor iterator and retriever functions. I can move those here as well, but this diff was already becoming very large, so I postponed that for now - Moved
precompute_
to be a static member here so it can use the template arguments ofnode_t
. It used to be a private member inindex_gt
. As already noted,inverse_log_connectivity
ofprecomputed_constants_t
does not really belong here. Happy to address it.
|
||
other.nodes_count_ = nodes_count_.load(); | ||
other.max_level_ = max_level_; | ||
other.entry_slot_ = entry_slot_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy not implemented
*/ | ||
template <typename input_callback_at, typename progress_at = dummy_progress_t> | ||
serialization_result_t load_from_stream(input_callback_at&& input, progress_at&& progress = {}) noexcept { | ||
|
||
serialization_result_t result; | ||
|
||
// Remove previously stored objects | ||
reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done at a higher level API.
We cannot do it here because the higher level in index_dense
could have already loaded vectors into storage.
Calling reset
on the inner index would call reset
on storage and wipe out newly loaded vectors.
This is kind of bad and tricky. I have not found a better design around this.
So far, I think this trickyness is fundamental in usearch v2 storage (separate vectors and nodes) where index_dense
owns and takes care of vectors, while index
takes care of nodes.
I think this division requires that any storage which stores both have shared ownership between index_dense
and index
.
Assuming in Usearch v3 we move to a format that stores vectors and nodes together, this problem will go away.
Open to other suggestions in the meantime.
static_assert( // | ||
sizeof(typename tape_allocator_traits_t::value_type) == 1, // | ||
"Tape allocator must allocate separate addressable bytes"); | ||
using span_bytes_t = span_gt<byte_t>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the result of a call to node_bytes
|
||
// Load metadata and choose the right metric | ||
{ | ||
index_dense_head_buffer_t buffer; | ||
if (!input(buffer, sizeof(buffer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_at::load_vectors_from_stream
takes a generic buffer and reads off bytes into it from the specified section in the storage buffer, per storage format spec
@@ -748,11 +750,10 @@ class index_dense_gt { | |||
unique_lock_t lookup_lock(slot_lookup_mutex_); | |||
|
|||
std::unique_lock<std::mutex> free_lock(free_keys_mutex_); | |||
// storage_ cleared by typed_ todo:: is this confusing? | |||
typed_->clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_
is reset by typed_.reset()
;
nothing bad will happen here if I do it again, but this seemed clearer
SIMSIMD, OPENMP and FP16 related cmake options are not properly propaged to compiler header definitions, when they are set to non-default values. This commit fixes compile definitions so those values are always propagated properly E.g., by default, simsimd usage is turned off and as we see in the commands below, correct default `#define`s(i.e. `-DUSEARCH_USE_SIMSIMD=0`) are passed to the compiler: cmake .. make VERBOSE=1 > cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp && /usr/bin/c++ -DUSEARCH_USE_OPENMP=0 -DUSEARCH_USE_SIMSIMD=0 ... -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp But, if we try to enable simsimd via cmake for benchmarking and shared C libraries, we do not get the corresponding -DUSEARCH_USE_SIMSIMD=1 definition. cmake .. -DUSEARCH_USE_SIMSIMD=1 make VERBOSE=1 cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp && /usr/bin/c++ -DUSEARCH_USE_OPENMP=0 ... -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp Note that no definition for `USEARCH_USE_SIMSIMD` was passed to the compiler. Internally, the lack simsimd config definition assumes -DUSEARCH_USE_SIMSIMD=0 value. (see [1_simsimd_logic_in_plugins]) When compiling after adding this commit, we see that we can successfully enable simsimd via cmake option cmake .. -DUSEARCH_USE_SIMSIMD=1 make VERBOSE=1 cd /home/ngalstyan/lantern/lantern/third_party/usearch/build/cpp && /usr/bin/c++ -DUSEARCH_USE_FP16LIB=1 -DUSEARCH_USE_OPENMP=0 -DUSEARCH_USE_SIMSIMD=1 -o CMakeFiles/bench_cpp.dir/bench.cpp.o -c .../bench.cpp [1_simsimd_logic_in_plugins]: https://github.com/unum-cloud/usearch/blob/4747ef42f4140a1fde16118f25f079f9af79649e/include/usearch/index_plugins.hpp#L43-L45
Copied the logic from simsimd. Alternatively, the whole block could be dropped to offload detection to simsimd
index_plugins configures simsimd and when simsimd is included before this configuration gets a chance to run during compilation, simsimd.h may be misconfigured In particular, index_plugins propagates USEARCH_FP16LIB cmake options as !SIMSIMD_NATIVE_FP16 (see [1]) and if simsimd.h is included before index_plugins, wrong value of SIMSIMD_NATIVE_FP16 may be chosen [1]: https://github.com/unum-cloud/usearch/blob/ce54b814a8a10f4c0c32fee7aad9451231b63f75/include/usearch/index_plugins.hpp#L50
…h is actually used by index.hpp
passing all functional tests, but there are memory leaks
…ng it in index_* classes
… issues under some conditions
03ced4c
to
ec3ed82
Compare
* Add move construction tests and fix an issue caused by them * Only consider zero length IO an error if input buffer was larger than zero * Move option-override policy opt-in before policy definitions so overrides actually take effect
# [2.9.0](v2.8.16...v2.9.0) (2024-02-22) ### Add * SQLite binding ([222de55](222de55)) * String distances to SQLite ([ae4d0f0](ae4d0f0)) ### Docs * Header refreshed ([7465c29](7465c29)) * Py and SQLite extensions ([550624b](550624b)) * README.md link to Joins (#327) ([1279c54](1279c54)), closes [#327](#327) ### Fix * bug reports were immediately marked invalid ([c5fc825](c5fc825)) * Error handling, mem safety bugs #335 (#339) ([4747ef4](4747ef4)), closes [#335](#335) [#339](#339) * Passing SQLite tests ([6334983](6334983)) * Reported number of levels ([9b1a06a](9b1a06a)) * Skip non-Linux SQLite tests ([b02d262](b02d262)) * SQLite cosine function + tests ([55464fb](55464fb)) * undefined var error in `remove` api ([8d86a9e](8d86a9e)) ### Improve * Multi property lookup ([e8bf02c](e8bf02c)) * Support multi-column vectors ([66f1716](66f1716)) ### Make * `npi ci` (#330) ([5680920](5680920)), closes [#330](#330) * Add 3.12 wheels ([d66f697](d66f697)) * Change include paths ([21db294](21db294)) * invalid C++17 Clang arg ([2a6d779](2a6d779)) * Link libpthread for older Linux GCC builds (#324) ([6f1e5dd](6f1e5dd)), closes [#324](#324) * Parallel CI for Python wheels ([a9ad89e](a9ad89e)) * Upgrade SimSIMD & StringZilla ([5481bdf](5481bdf)) ### Revert * Postpone Apache Arrow integration ([5d040ca](5d040ca))
Ashot jan Hi!
tldr;
This is an attempt to add external storage to USearch to help our upgrades at Lantern !
It allows swapping the storage format (usearch-v2, usearch-v3, lantern-postgres) without touching the core index structures.
As far as I can tell it does not have a runtime performance impact.
Would you be open to merging this kind of interface into upstream usearch or should we maintain it outside?
We have been using a fork of USearch with this kind of external storage for about half a year at Lantern. This is an attempt to upstream it. We have chatted about this before so some of the stuff below may be repetition, but putting it here for completeness.
Motivation
Currently, the core high-performance implementation of vector search is weaved through storage, serialization, file IO interfaces. This makes it harder to:
m
in hnsw is large, neighbor lists become a significant portion of index memory footprint)One might argue that (1) can be achieved by passing a custom allocator to
index_gt
orindex_dense_gt
. This has limitations and did not work for us for two reasons:index_gt
. In Lantern, we are dealing with a persistent index - all changes are saved to postgres data files and replicated if needed. So, the index memory needs to outlive any usearch data structures.i
, nodei+1
, etc)The storage interface proposed here helps us achieve the goals above.
Design
This PR adds a
storage_at
template parameter to usearch index types which implements:The exact storage layout is opaque to the rest of usearch - all serialization/deserialization logic is in
storage_at
so new storage formats can be implemented without touching the rest of the code.As an example, I implemented a new storage provider in
std_storage.hpp
that uses cpp standard library containers and stores nodes and vectors adjacent to each other when serializing to a file (similar to usearch v1 format, but this one adds padding between node tape and vector tape in serialization to make sure view() does not result in unaligned memory accesses).The Storage API
I designed the storage API around how the current usearch v2 storage worked. I tried to minimize amount of changes in
index.hpp
andindex_dense.hpp
to hopefully make reviewing easier. I think the storage interface can be simplified and improved in many ways, especially after a usearch v3 format transition. I am open to changing the full API, so long as there is some kind of storage API.NOTE: There is no new logic in this PR. most of it is just factoring out storage-related interfaces and functions to the separate header.
The storage API, as defined in the beginning of
storage.hpp
and implemented by several storage backends.index_gt
andindex_dense_gt
were modified to use this storage API.I added a helper type-enforcer macro that runs compile-time checks to make sure the provided interface meets the necessary interface- requirements to be a usearch storage provider.
Next?
This has some rough edges, most of which should be listed below. I will come back and update this if more things come up.
Before putting time into those, however, I just wanted to see whether you would be open to merging this into mainline usearch. This would help us at Lantern a lot and would be a big step towards upstream-usearch compatibility for us.
We will likely start using a simplified version of this API from Lantern soon, so can report back on how well it works for our case.
TODOs
index.hpp
progress&
(I copied the API from current usearch and some APIs there are not takingprogress&
)