Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Add sparse index support to knowhere #199

Closed
wants to merge 5 commits into from

Conversation

zhengbuqian
Copy link
Collaborator

@zhengbuqian zhengbuqian commented Nov 17, 2023

/kind feature
#193

This PR adds the basic framework of supporting sparse index in knowhere, with 2 index types: inverted index and wand.

In this PR only Search is supported.

Support for growing segment is sub-optimal, we naively double the underlying data structure each time we used up the current space.

Added basic unit tests:

  • Search without filtering, with different dataset sizes/recall/sparisity, etc.
  • Search with filtering, with different filter rate.

@sre-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zhengbuqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

… added related unit test

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
}
table_t
pop() {
std::pop_heap(pool_.begin(), pool_.begin() + size_--);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please

std::pop_heap(pool_.begin(), pool_.begin() + size_);
size_ -= 1;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dist += it->second * data[k];
}
}
if (dist > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code does not distinguish two very different situations:

  1. dist=0 because no item was found in query
  2. dist=0 because the match is exact

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only IP is supported thus dist=0 doesn't mean exact match?

int result_size = heap.size();
for (int64_t j = result_size - 1; j >= 0; --j) {
cur_labels[j] = heap.top().id;
cur_distances[j] = -heap.top().distance;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working with negative distances twice... I'd rather change MinMaxHeap implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
std::sort(q_vec.begin(), q_vec.end(),
[](const auto& lhs, const auto& rhs) { return std::abs(lhs.second) > std::abs(rhs.second); });
while (q_vec.size() && q_vec[0].second * drop_ratio_search > q_vec.back().second) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while (q_vec.size() && ...) -> while (!q_vec.empty() && ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


private:
[[nodiscard]] float
dot_product(std::unordered_map<IndicesT, T> q_map, table_t u) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dot_product(const std::unordered_map<IndicesT, T>& q_map, table_t u) const {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

void
refine_and_collect(std::unordered_map<IndicesT, T>& q_map, MinMaxHeap<T> inaccurate, size_t k, float* distances,
Copy link
Collaborator

@alexanderguzhva alexanderguzhva Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void refine_and_collect(const std::unordered_map<IndicesT, T>& q_map, MinMaxHeap<T>& inaccurate, size_t k, float* distances, label_t* labels) const

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (refine_factor == 1) {
collect_result(heap, distances, labels);
} else {
std::unordered_map<IndicesT, T> q_map;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the size of the q_map is known beforehand (it is len). So, please construct q_map with the provided number of buckets (I'd try 4 * len as a start), so

std::unordered_map<IndicesT, T> q_map(4 * len);

The number of buckets impacts the performance of the following dot_product()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I can tweak the performance as soon as this commit lands, there are many things to play with there :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I am using 4*len as a start and leaving a TODO. Focusing on the functionality for this PR.


template <typename HeapType>
void
collect_result(HeapType heap, float* distances, label_t* labels) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeapType& heap

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}
Cursor(const Cursor& rhs) = delete;
Cursor(Cursor&& rhs) noexcept {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't Cursor(Cursor&& rhs) noexcept = default; work?
Same question for Cursor& operator=(Cursor&& rhs) noexcept

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am changing lut_ to be a const ref member, move constructor, move assignment, copy assignment will now be deleted, and I am still explicitly deleting the copy constructor as we don't currently have a usage.

swap(lhs.q_value_, rhs.q_value_);
// all cursors share the same bitset
}
const std::vector<Neighbor<T>>* lut_;
Copy link
Collaborator

@alexanderguzhva alexanderguzhva Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::vector<Neighbor<T>>* lut_ = nullptr;
Also, using vector<>* leads to double pointer dereferencing, unless the compiler optimizes this away

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I am removing the swap method and changing lut_ to const std::vector<Neighbor<T>>&.

size_t num_vec_ = 0;
float max_score_ = 0.0f;
float q_value_ = 0.0f;
const BitsetView* bitset_;
Copy link
Collaborator

@alexanderguzhva alexanderguzhva Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BitsetView is a lightweight structure (a wrapper around the pointer, basically) that should be passed by value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

…to const ref; update multiple InvertedIndex member functions to accept reference type for container types; addressed other review comments

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
@alexanderguzhva
Copy link
Collaborator

the functionality seems to be fine, but what about the unit tests running time, is it expected?


private:
[[nodiscard]] float
dot_product(const std::unordered_map<IndicesT, T>& q_map, table_t u) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unordered_map seems with a bad performance as I tested. It's better to use two arrays indices and data

…compute refined distance.

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
@zhengbuqian
Copy link
Collaborator Author

the functionality seems to be fine, but what about the unit tests running time, is it expected?

I added some large unit tsets so ut took quite long, I have removed them. Will revert the change in ut yaml before merge.

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
@mergify mergify bot removed the ci-passed label Nov 22, 2023
@zhengbuqian
Copy link
Collaborator Author

/run-e2e

@zhengbuqian
Copy link
Collaborator Author

run-e2e

@mergify mergify bot added ci-passed and removed ci-passed labels Nov 22, 2023
@alexanderguzhva
Copy link
Collaborator

/lgtm overall

@yellow-shine
Copy link
Collaborator

/run-e2e

1 similar comment
@yellow-shine
Copy link
Collaborator

/run-e2e

@mergify mergify bot added the ci-passed label Nov 23, 2023
@zhengbuqian
Copy link
Collaborator Author

/hold

@zhengbuqian
Copy link
Collaborator Author

Use #341 instead

@zhengbuqian zhengbuqian deleted the sparse-v1 branch March 14, 2024 08:03
@mergify mergify bot added needs-dco and removed dco-passed labels Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants