Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Top k Collector with Linear time selection #298

Closed
kcm1700 opened this issue May 11, 2018 · 10 comments
Closed

Top k Collector with Linear time selection #298

kcm1700 opened this issue May 11, 2018 · 10 comments

Comments

@kcm1700
Copy link

kcm1700 commented May 11, 2018

From #283 (comment)

About the time complexity, using linear time selection for each segment has slightly better time complexity in theory.

Linear time selection in each segment works like this. Instead of the heap of size k, create an array of capacity 2k. For each hit, push back in the array. When the count of docs in the array reaches 2k, perform linear selection to find the top k docs. Drop the rest. Continue this. The final result is the top k docs in random order.

In the merge phase, use a heap to get docs in sorted order. O(k log (kS)) heap insertions.

Let's experiment the idea. The algorithm might work better for large k.

Time Complexity

If capacity = 2 k, collect() runs in amortized O(1). score_docs() runs in O(k log k).

Of course we can try different capacity, but it is not that interesting. If capacity = (1+α)k,
→ linear time selection happens at most (n - (1+α)k) / (αk) times,
collect() runs in amortized O(1+α ⁻¹). score_docs() runs in O(k log k).

About the Current topcollector

binary heap is used. For each insertion to the heap, it takes O(log k). From #283 (comment) "If hits are shuffled ... The overall number of inserts is asymptotically equivalent to k log(n)".

Linear time selection algorithm

There is order-stat crate, but maybe we need better Floyd Rivest implementation which handles duplicate elements efficiently.

@w32zhong
Copy link

w32zhong commented Jun 29, 2018

the probability of having to insert doc n in the heap is k/n.
From #283 (comment)

I think some of the hits will replace the old one in the heap right? The insertion number can be significantly larger than that.

Insertion is the dominant operation, if we just look at the complexity assuming all first k hits would need to be inserted, as @kcm1700 stated, the insertion is O(log k) vs O(1). (in minheap case, the complexity given k insertion is k log(k), and in the stated linear selection approach, complexity is (k) + (k + k/2 + k/4 + ...) = k + sum_{d=0}^{log k} (k/(2^d)) = O(k) )

But I suspect the mainstream search engine still uses binary heap is for cache efficiency because heap has better memory spatial locality.

@kcm1700
Copy link
Author

kcm1700 commented Jun 30, 2018

Right, the insertion number can be significantly larger. The comment assumes documents are shuffled randomly, which is not necessarily true. But if we assume that, the probability of having to insert the i-th document in the heap is k / i. Then, E(# of insertions) = sum_{i=k}^{n} k / i = O(k log n) Therefore, on average, time complexity is O(k log n log k). In the worst case, time complexity is O(n log k).

I think linear selection algorithms are usually cache friendly, so I guess it shouldn't really matter.

I guess the most tools are using binary heap because the performance of this part is not important.

@w32zhong
Copy link

My major concern with linear selection is when the array gets reallocated longer and longer (that is the case where it performs better in theory), the entire array just does not fit into cache, and in minheap, when you insert, you rotated up level by level with better locality. But this is purely based on my perception, it would be interesting to see both running and compare them.

@fulmicoton
Copy link
Collaborator

You misunderstood @kcm1700 solution. He suggests to work off a Vec of capacity 2k.

You add elements to it (probably if they are below the threshold). Once the array reach its capacity apply linear selection to bring it back to a len of k. (if a threshold is used compute the max as well).

@fulmicoton
Copy link
Collaborator

@kcm1700 Actually that solution could probably be adapted to handle pagination
ie: displaying from page 10 to page 20

@fulmicoton
Copy link
Collaborator

If someone wants to pick that and bench it properly that would be awesome.

@kcm1700
Copy link
Author

kcm1700 commented Jan 25, 2019

@fulmicoton I think you are right, we can probably modify the algorithm to handle pagination. I have not thought of that possibility. Good call

I guess we can handle pagination with a typical heap data structure too. Maybe worth trying that first

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jan 25, 2019 via email

@fulmicoton
Copy link
Collaborator

fulmicoton commented Jan 25, 2019 via email

@fulmicoton
Copy link
Collaborator

It has been implemented already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants