-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add sparse index support to knowhere #341
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zhengbuqian The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
for (size_t i = 0; i < rows; ++i) { | ||
raw_data_[current_rows + i] = data[i]; | ||
for (auto [idx, val] : data[i].data) { | ||
if (val < val_threshold) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if drop_ratio_build_ == 0
, then val_threshold
is 0
, so negative val
values will be skipped. Is it by design? This implies that sparse values are expected to be positive only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alex! I have changed it so that now we ignore only values that are close enough to zero.
Also the drop strategy has been modified from "drop any value that is below ratio * global max" to "drop the smallest ratio% abs values".
3958730
to
884f2b6
Compare
345c673
to
950a375
Compare
950a375
to
4108734
Compare
include/knowhere/sparse_utils.h
Outdated
|
||
SparseRow& | ||
operator=(const SparseRow& other) { | ||
copy_from(other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do all these nasty C++ things like if (this == &other)
here.
Also, typically it is recommended to create a clone object
SparseRow clone(other);
swap(*this, clone);
just in case of an exception during the cloning process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
} | ||
|
||
size_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move constructor and move assignment operator please :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
include/knowhere/sparse_utils.h
Outdated
|
||
IdVal<T> | ||
operator[](size_t i) const { | ||
table_t index = *(table_t*)(data + i * (sizeof(table_t) + sizeof(T))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these raw pointers operations are a nightmare. Please introduce a helper struct, such as
template<typename T>
struct SparseRowElement {
table_t idx;
T value;
};
make sure that it is packed and just use reinterpret_cast<SparseRowElement<T>*>(data)
or reinterpret_cast<const ...*>(data)
for working .
Alternatively, just use IdVal<T>
but make sure that it is packed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added a private structure.
|
||
// When pushing new elements into a MaxMinHeap, only `capacity` elements with the | ||
// largest val are kept. pop()/top() returns the smallest element out of them. | ||
template <typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have any similar structures in our code already? I'm just curious, the implementation is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see one that matches exactly the use case.
return Status::success; | ||
} | ||
|
||
std::vector<T> vals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just compute the size that needs to be preallocated, such as
size_t amount = 0;
for (size_t i = 0; i < rows; i++) { amount += data[i].size(); }
std::vector<T> vals(amount);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (v < q_threshold || i >= n_cols_internal()) { | ||
continue; | ||
} | ||
for (size_t j = 0; j < inverted_lut_[i].size(); j++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this loop can be improved via SIMD. Maybe, later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great suggestion! I'll leave that as is for now but I left a TODO.
|
||
// LUT supports size() and operator[] which returns an IdVal. | ||
template <typename LUT> | ||
class Cursor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the design of the iterator is somewhat unusual, but it is fine
void | ||
Search(const SparseRow<T>& query, size_t k, float drop_ratio_search, float* distances, label_t* labels, | ||
size_t refine_factor, const BitsetView& bitset) const { | ||
std::shared_lock<std::shared_mutex> lock(mu_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: you could move the lock of mu_
a little bit down further, because in the following lines you do nothing but std::fill
and not using any of the fields of this struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
table_t index; | ||
T value; | ||
ElementProxy() = delete; | ||
ElementProxy(const ElementProxy&) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add ElementProxy& operator=(const ElementProxy&) = delete;
as well
On the other hand, I don't understand why it is needed to delete the constructors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleting the default and the copy constructor(so no constructor will be generated) should be sufficient. if no instance can ever be made there is no way to call copy assignment operator.
the reason is this element is used solely to help access what's in the raw data_
pointer, we don't actually have the need to create such objects.
src/common/comp/brute_force.cc
Outdated
for (auto& fut : futs) { | ||
fut.wait(); | ||
RETURN_IF_ERROR(fut.result().value()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use the following to make sure that no futs get unfinished
for (auto& fut : futs) fut.wait();
for (auto& fut : futs) RETURN_IF_ERROR(fut.result().value());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is correct. Considering the lambda does not return a Status other than success, I removed the return value and use fut.get()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are actually other places that made the same mistake, creating another PR to fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #355
src/common/comp/brute_force.cc
Outdated
return expected<DataSetPtr>::Err(status, msg); | ||
} | ||
int topk = cfg.k.value(); | ||
auto labels = new sparse::label_t[nq * topk]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential memory leak. I'd use something like
auto labels = std::make_unique<sparse::label_t>(nq * topk);
auto distances = std::make_unique<float>(nq * topk);
SearchSparseWithBuf(base_dataset, query_dataset, labels.get(), distances.get(), config, bitset);
return GenResultDataSet(nq, topk, labels.release(), distances.release());
If we're using C++ 17, of course
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
auto refine_factor = cfg.refine_factor.value_or(10); | ||
auto drop_ratio_search = cfg.drop_ratio_search.value_or(0.0f); | ||
|
||
auto p_id = new sparse::label_t[nq * k]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for using std::make_unique<>
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for (auto& fut : futs) { | ||
fut.wait(); | ||
} | ||
return GenResultDataSet(nq, k, p_id, p_dist); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (auto& fut : futs) RETURN_IF_ERROR(fut.result().value());
is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to fut.get()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comments to #355 . Basically, it does not work as desired.
b569ba7
to
1ee5dfb
Compare
/hold |
/unhold |
Please change for using |
… added related unit test Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
1ee5dfb
to
c0f90d3
Compare
/lgtm |
/kind feature
#193
This PR adds the basic framework of supporting sparse index in knowhere, with 2 index types: inverted index and wand.
In this PR only Search and GetVectorByIds is supported.
Added basic unit tests:
Search without filtering, with different dataset sizes/recall/sparisity, etc.
Search with filtering, with different filter rate.