Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add sparse index support to knowhere #341

Merged
merged 1 commit into from
Jan 24, 2024

Conversation

zhengbuqian
Copy link
Collaborator

/kind feature
#193

This PR adds the basic framework of supporting sparse index in knowhere, with 2 index types: inverted index and wand.

In this PR only Search and GetVectorByIds is supported.

Added basic unit tests:

Search without filtering, with different dataset sizes/recall/sparisity, etc.
Search with filtering, with different filter rate.

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

for (size_t i = 0; i < rows; ++i) {
raw_data_[current_rows + i] = data[i];
for (auto [idx, val] : data[i].data) {
if (val < val_threshold) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if drop_ratio_build_ == 0, then val_threshold is 0, so negative val values will be skipped. Is it by design? This implies that sparse values are expected to be positive only

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alex! I have changed it so that now we ignore only values that are close enough to zero.

Also the drop strategy has been modified from "drop any value that is below ratio * global max" to "drop the smallest ratio% abs values".

@zhengbuqian
Copy link
Collaborator Author

@alexanderguzhva @hhy3 @liliu-z PTAL


SparseRow&
operator=(const SparseRow& other) {
copy_from(other);
Copy link
Collaborator

@alexanderguzhva alexanderguzhva Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do all these nasty C++ things like if (this == &other) here.
Also, typically it is recommended to create a clone object

SparseRow clone(other);
swap(*this, clone);

just in case of an exception during the cloning process

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

size_t
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move constructor and move assignment operator please :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


IdVal<T>
operator[](size_t i) const {
table_t index = *(table_t*)(data + i * (sizeof(table_t) + sizeof(T)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these raw pointers operations are a nightmare. Please introduce a helper struct, such as

template<typename T>
struct SparseRowElement {
   table_t idx;
   T value;
};

make sure that it is packed and just use reinterpret_cast<SparseRowElement<T>*>(data) or reinterpret_cast<const ...*>(data) for working .

Alternatively, just use IdVal<T> but make sure that it is packed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added a private structure.


// When pushing new elements into a MaxMinHeap, only `capacity` elements with the
// largest val are kept. pop()/top() returns the smallest element out of them.
template <typename T>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have any similar structures in our code already? I'm just curious, the implementation is fine

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see one that matches exactly the use case.

return Status::success;
}

std::vector<T> vals;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just compute the size that needs to be preallocated, such as

size_t amount = 0;
for (size_t i = 0; i < rows; i++) { amount += data[i].size(); }

std::vector<T> vals(amount);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (v < q_threshold || i >= n_cols_internal()) {
continue;
}
for (size_t j = 0; j < inverted_lut_[i].size(); j++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this loop can be improved via SIMD. Maybe, later

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great suggestion! I'll leave that as is for now but I left a TODO.


// LUT supports size() and operator[] which returns an IdVal.
template <typename LUT>
class Cursor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the design of the iterator is somewhat unusual, but it is fine

void
Search(const SparseRow<T>& query, size_t k, float drop_ratio_search, float* distances, label_t* labels,
size_t refine_factor, const BitsetView& bitset) const {
std::shared_lock<std::shared_mutex> lock(mu_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could move the lock of mu_ a little bit down further, because in the following lines you do nothing but std::fill and not using any of the fields of this struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mergify mergify bot removed the ci-passed label Jan 18, 2024
table_t index;
T value;
ElementProxy() = delete;
ElementProxy(const ElementProxy&) = delete;
Copy link
Collaborator

@alexanderguzhva alexanderguzhva Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add ElementProxy& operator=(const ElementProxy&) = delete; as well
On the other hand, I don't understand why it is needed to delete the constructors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleting the default and the copy constructor(so no constructor will be generated) should be sufficient. if no instance can ever be made there is no way to call copy assignment operator.

the reason is this element is used solely to help access what's in the raw data_ pointer, we don't actually have the need to create such objects.

for (auto& fut : futs) {
fut.wait();
RETURN_IF_ERROR(fut.result().value());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use the following to make sure that no futs get unfinished

for (auto& fut : futs) fut.wait();
for (auto& fut : futs) RETURN_IF_ERROR(fut.result().value());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is correct. Considering the lambda does not return a Status other than success, I removed the return value and use fut.get() instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are actually other places that made the same mistake, creating another PR to fix

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #355

return expected<DataSetPtr>::Err(status, msg);
}
int topk = cfg.k.value();
auto labels = new sparse::label_t[nq * topk];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential memory leak. I'd use something like

auto labels = std::make_unique<sparse::label_t>(nq * topk);
auto distances = std::make_unique<float>(nq * topk);

SearchSparseWithBuf(base_dataset, query_dataset, labels.get(), distances.get(), config, bitset);
return GenResultDataSet(nq, topk, labels.release(), distances.release());

If we're using C++ 17, of course

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

auto refine_factor = cfg.refine_factor.value_or(10);
auto drop_ratio_search = cfg.drop_ratio_search.value_or(0.0f);

auto p_id = new sparse::label_t[nq * k];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for using std::make_unique<> as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (auto& fut : futs) {
fut.wait();
}
return GenResultDataSet(nq, k, p_id, p_dist);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (auto& fut : futs) RETURN_IF_ERROR(fut.result().value()); is missing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to fut.get()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comments to #355 . Basically, it does not work as desired.

@alexanderguzhva
Copy link
Collaborator

alexanderguzhva commented Jan 19, 2024

/hold
#355-related comments need to be addressed first
but it looks good overall

@alexanderguzhva
Copy link
Collaborator

/unhold
folly's collectAll() works

@alexanderguzhva
Copy link
Collaborator

alexanderguzhva commented Jan 22, 2024

Please change for using collectAll() for futures and I'll lgtm the diff
Thanks

… added related unit test

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
@hhy3
Copy link
Collaborator

hhy3 commented Jan 24, 2024

/lgtm

@sre-ci-robot sre-ci-robot merged commit 0f63f2d into zilliztech:main Jan 24, 2024
9 checks passed
@zhengbuqian zhengbuqian deleted the sparse-v2 branch January 24, 2024 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants