-
-
Notifications
You must be signed in to change notification settings - Fork 693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Top k Collector with Linear time selection #298
Comments
I think some of the hits will replace the old one in the heap right? The insertion number can be significantly larger than that. Insertion is the dominant operation, if we just look at the complexity assuming all first k hits would need to be inserted, as @kcm1700 stated, the insertion is O(log k) vs O(1). (in minheap case, the complexity given k insertion is k log(k), and in the stated linear selection approach, complexity is (k) + (k + k/2 + k/4 + ...) = k + sum_{d=0}^{log k} (k/(2^d)) = O(k) ) But I suspect the mainstream search engine still uses binary heap is for cache efficiency because heap has better memory spatial locality. |
|
Right, the insertion number can be significantly larger. The comment assumes documents are shuffled randomly, which is not necessarily true. But if we assume that, the probability of having to insert the i-th document in the heap is k / i. Then, E(# of insertions) = sum_{i=k}^{n} k / i = O(k log n) Therefore, on average, time complexity is O(k log n log k). In the worst case, time complexity is O(n log k). I think linear selection algorithms are usually cache friendly, so I guess it shouldn't really matter. I guess the most tools are using binary heap because the performance of this part is not important. |
|
My major concern with linear selection is when the array gets reallocated longer and longer (that is the case where it performs better in theory), the entire array just does not fit into cache, and in minheap, when you insert, you rotated up level by level with better locality. But this is purely based on my perception, it would be interesting to see both running and compare them. |
|
You misunderstood @kcm1700 solution. He suggests to work off a Vec of capacity You add elements to it (probably if they are below the threshold). Once the array reach its capacity apply linear selection to bring it back to a len of k. (if a threshold is used compute the max as well). |
|
@kcm1700 Actually that solution could probably be adapted to handle pagination |
|
If someone wants to pick that and bench it properly that would be awesome. |
|
@fulmicoton I think you are right, we can probably modify the algorithm to handle pagination. I have not thought of that possibility. Good call I guess we can handle pagination with a typical heap data structure too. Maybe worth trying that first |
|
Unsure actually. I was triaging tickets so I didn't have time to think too
much about it.
Le ven. 25 janv. 2019 à 09:47, ChanMin Kim <notifications@github.com> a
écrit :
… @fulmicoton <https://github.com/fulmicoton> I think you are right, we can
probably modify the algorithm to handle pagination. I have not thought of
that possibility. Good call
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#298 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA-WQjy01XAjwaQiCZ_JKS33Ty5krBE2ks5vGlQZgaJpZM4T7wkn>
.
|
|
Only the last operation would be beneficial
Le ven. 25 janv. 2019 à 10:07, Paul Masurel <paul.masurel@gmail.com> a
écrit :
… Unsure actually. I was triaging tickets so I didn't have time to think too
much about it.
Le ven. 25 janv. 2019 à 09:47, ChanMin Kim ***@***.***> a
écrit :
> @fulmicoton <https://github.com/fulmicoton> I think you are right, we
> can probably modify the algorithm to handle pagination. I have not thought
> of that possibility. Good call
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#298 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AA-WQjy01XAjwaQiCZ_JKS33Ty5krBE2ks5vGlQZgaJpZM4T7wkn>
> .
>
|
|
It has been implemented already. |
From #283 (comment)
Let's experiment the idea. The algorithm might work better for large k.
Time Complexity
If capacity = 2 k,
collect()runs in amortized O(1).score_docs()runs in O(k log k).Of course we can try different capacity, but it is not that interesting. If capacity = (1+α)k,
→ linear time selection happens at most (n - (1+α)k) / (αk) times,
→
collect()runs in amortized O(1+α ⁻¹).score_docs()runs in O(k log k).About the Current topcollector
binary heap is used. For each insertion to the heap, it takes O(log k). From #283 (comment) "If hits are shuffled ... The overall number of inserts is asymptotically equivalent to k log(n)".
Linear time selection algorithm
There is order-stat crate, but maybe we need better Floyd Rivest implementation which handles duplicate elements efficiently.
The text was updated successfully, but these errors were encountered: