-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance changes (draft) #18
Conversation
3aac183
to
81fc550
Compare
fmt::{Debug, Formatter}, | ||
hash::Hash, | ||
usize, | ||
}; | ||
|
||
use crate::{FieldAccessor, Filter, Tokenizer}; | ||
use crate::{FieldAccessor, Tokenizer}; | ||
use hashbrown::{HashMap, HashSet}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's quite a significant change in performance, I guess most of it is the switch to hashbrown
?
I can imagine a small gain from removing the Filter
calls but not that significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Most is hashbrown . A small portion seem to be from the Cow change in the add_document
also, but much less significant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I think removing Filter
is a good decision, less is more. It seems like hashbrown
gives most of the performance improvement, ~25% speedup is a big deal 🙌
I was playing our with the library so I created a few changes I could possibly split into multiple PRs.
Would like to hear you thoughts about this
Changes for improved performance
add_document
useCow
for hash keys.hashbrown
Hashmap instead ofstd::collections
. Faster for small hash keys. Though does not provide the same level of HashDoS resistanceTermData
query_terms
property is replaced withquery_terms_len
to prevent unnecessary copies of all query terms during query (not benchmarked yet) (Breaking change for theScoreCalculator
trait)Bench results on my computer
add_100k_docs
Master
301.19 ms
This branch
221.24 ms
API change ideas
Filter
is removed, since theTokenizer
can just be seen as a preprocessing step to indexation and query and can do anything that theFilter
does. One argument less to worry about. (Breaking change)