-
Notifications
You must be signed in to change notification settings - Fork 538
High Performance Indexing #3607
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
This doc outlines the steps of how an optimal pipeline on parsing JSON and passing it to tantivy could look like. Currently the JSON handling costs around 20% - 30% of total CPU time when ingesting data. Additional hidden costs for nested JSON and cache locality are hard to gauge.
Note: We could easily parallelize JSON parsing to increase throughput. But that CPU time could be saved, or be used for higher compression instead.
Steps
- Retrieve unparsed JSON as
String - Validate UTF-8 once on the unparsed JSON. (Currently this is done for every String)
- Find positions of (potentially escaped) strings (like in simd_json)
- Unescape the strings inplace and update the positions. It can be done on the original String, since escaped JSON strings are always longer. E.g. escape inplace and write zeros, to keep the string length the same.
key:"\"My Quote\""=>key:""My Quote""00 - Now we can parse the json and reference everything with
&str(similar to serde_json_borrow). As a added bonus this will increase cache locality.
5.1. Consider parsing into decimal for floats - Nested data in JSON is usually allocated in a
BTreeMap, but this is unnecessary work since we want it flattened anyway. Therefore we can preflatten the it into aVec<&str, tantivy::Value>(Related to Big indexing refactoring tantivy#2015). We'll need two groups- Root Level Attributes:
&str => Value - Nested Attributes: JSON path (e.g.
"json.vals.blub", but use tantivy format with \0 as a separator)String => Value. MaybeArc<String> => Valuefrom a JSON pathHashmap<&[&[u8]], Arc<String>
- Root Level Attributes:
- We would have something that looks similar to:
struct QuickwitDoc{
unparsed: String,
root: Vec<(&'static str, Value<'static'>)>, // references slices from unparsed, cast to static lifetime
nested : Vec<(Arc<String>, Value<'static'>)>,// references slices from unparsed, cast to static lifetime
}- Implement the document trait on QuickwitDoc, to avoid conversion into
DocumentDocument as trait tantivy#1352
Notes
I'm not sure how array<> is handled in quickwit currently.
QuickwitDoc should probably contain a UnorderedId from quickwit-oss/tantivy#2015
Side Note: Indexing throughput can already be considered high performance as of now with ~30Mb/s
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request