High Performance Indexing

This doc outlines the steps of how an optimal pipeline on parsing JSON and passing it to tantivy could look like. Currently the JSON handling costs around 20% - 30% of total CPU time when ingesting data. Additional hidden costs for nested JSON and cache locality are hard to gauge.

Note: We could easily parallelize JSON parsing to increase throughput. But that CPU time could be saved, or be used for higher compression instead.

## Steps

1. Retrieve unparsed JSON as `String`
2. Validate UTF-8 once on the unparsed JSON. (Currently this is done for every String)
3. Find positions of (potentially escaped) strings (like in [simd_json](https://github.com/simd-lite/simd-json/blob/main/src/sse42/stage1.rs#L79))
4. Unescape the strings inplace and update the positions. It can be done on the original String, since escaped JSON strings are always longer. E.g. escape inplace and write zeros, to keep the string length the same. 
`key:"\"My Quote\""` => `key:""My Quote""00`
5. Now we can parse the json and reference everything with `&str` (similar to [serde_json_borrow](https://github.com/pseitz/serde_json_borrow)). As a added bonus this will increase cache locality.
5.1. Consider parsing into decimal for floats
6. Nested data in JSON is usually allocated in a `BTreeMap`, but this is unnecessary work since we want it flattened anyway. Therefore we can preflatten the it into a `Vec<&str, tantivy::Value>` (Related to https://github.com/quickwit-oss/tantivy/issues/2015). We'll need two groups
    * Root Level Attributes: `&str => Value`
    * Nested Attributes: JSON path (e.g. `"json.vals.blub"`, but use tantivy format with \0 as a separator) `String => Value`. Maybe `Arc<String> => Value` from a JSON path `Hashmap<&[&[u8]], Arc<String>`
7. We would have something that looks similar to:
```rust
struct QuickwitDoc{
   unparsed: String,
   root: Vec<(&'static str, Value<'static'>)>, // references slices from unparsed, cast to static lifetime
   nested : Vec<(Arc<String>, Value<'static'>)>,// references slices from unparsed, cast to static lifetime
}
```
8. Implement the document trait on QuickwitDoc, to avoid conversion into `Document` https://github.com/quickwit-oss/tantivy/issues/1352


### Notes 
I'm not sure how `array<>` is handled in quickwit currently.
`QuickwitDoc` should probably contain a `UnorderedId` from https://github.com/quickwit-oss/tantivy/issues/2015
Side Note: Indexing throughput can already be considered high performance as of now with ~30Mb/s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Performance Indexing #3607

Steps

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

High Performance Indexing #3607

Description

Steps

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions